A significant number of hotel bookings are called off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.
The new technologies involving online booking channels have dramatically changed customers booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.
The cancellation of bookings impact a hotel on various fronts:
The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. INN Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.
# Data Description
The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.
Data Dictionary
Booking_ID: the unique identifier of each booking
no_of_adults: Number of adults
no_of_children: Number of Children
no_of_weekend_nights: Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel
no_of_week_nights: Number of weeknights (Monday to Friday) the guest stayed or booked to stay at the hotel
type_of_meal_plan: Type of meal plan booked by the customer:
Not Selected – No meal plan selected
Meal Plan 1 – Breakfast
Meal Plan 2 – Half board (breakfast and one other meal)
Meal Plan 3 – Full board (breakfast, lunch, and dinner)
required_car_parking_space: Does the customer require a car parking space? (0 - No, 1- Yes)
room_type_reserved: Type of room reserved by the customer. The values are ciphered (encoded) by INN Hotels Group
lead_time: Number of days between the date of booking and the arrival date arrival_year: Year of arrival date
arrival_month: Month of arrival date
arrival_date: Date of the month
market_segment_type: Market segment designation.
repeated_guest: Is the customer a repeated guest? (0 - No, 1- Yes)
no_of_previous_cancellations: Number of previous bookings that were canceled by the customer prior to the current booking
no_of_previous_bookings_not_canceled: Number of previous bookings not canceled by the customer prior to the current booking
avg_price_per_room: Average price per day of the reservation; prices of the rooms are dynamic. (in euros)
no_of_special_requests: Total number of special requests made by the customer (e.g. high floor, view from the room, etc)
booking_status: Flag indicating if the booking was canceled or not.
from google.colab import files
uploaded = files.upload()
Saving INNHotelsGroup.csv to INNHotelsGroup (1).csv
import numpy as np
import pandas as pd
# import libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
import io
df = pd.read_csv(io.BytesIO(uploaded['INNHotelsGroup.csv']))
print(df)
Booking_ID no_of_adults no_of_children no_of_weekend_nights \
0 INN00001 2 0 1
1 INN00002 2 0 2
2 INN00003 1 0 2
3 INN00004 2 0 0
4 INN00005 2 0 1
... ... ... ... ...
36270 INN36271 3 0 2
36271 INN36272 2 0 1
36272 INN36273 2 0 2
36273 INN36274 2 0 0
36274 INN36275 2 0 1
no_of_week_nights type_of_meal_plan required_car_parking_space \
0 2 Meal Plan 1 0
1 3 Not Selected 0
2 1 Meal Plan 1 0
3 2 Meal Plan 1 0
4 1 Not Selected 0
... ... ... ...
36270 6 Meal Plan 1 0
36271 3 Meal Plan 1 0
36272 6 Meal Plan 1 0
36273 3 Not Selected 0
36274 2 Meal Plan 1 0
room_type_reserved lead_time arrival_year arrival_month \
0 Room_Type 1 224 2017 10
1 Room_Type 1 5 2018 11
2 Room_Type 1 1 2018 2
3 Room_Type 1 211 2018 5
4 Room_Type 1 48 2018 4
... ... ... ... ...
36270 Room_Type 4 85 2018 8
36271 Room_Type 1 228 2018 10
36272 Room_Type 1 148 2018 7
36273 Room_Type 1 63 2018 4
36274 Room_Type 1 207 2018 12
arrival_date market_segment_type repeated_guest \
0 2 Offline 0
1 6 Online 0
2 28 Online 0
3 20 Online 0
4 11 Online 0
... ... ... ...
36270 3 Online 0
36271 17 Online 0
36272 1 Online 0
36273 21 Online 0
36274 30 Offline 0
no_of_previous_cancellations no_of_previous_bookings_not_canceled \
0 0 0
1 0 0
2 0 0
3 0 0
4 0 0
... ... ...
36270 0 0
36271 0 0
36272 0 0
36273 0 0
36274 0 0
avg_price_per_room no_of_special_requests booking_status
0 65.00 0 Not_Canceled
1 106.68 1 Not_Canceled
2 60.00 0 Canceled
3 100.00 0 Canceled
4 94.50 0 Canceled
... ... ... ...
36270 167.80 1 Not_Canceled
36271 90.95 2 Canceled
36272 98.39 2 Not_Canceled
36273 94.50 0 Canceled
36274 161.67 0 Not_Canceled
[36275 rows x 19 columns]
df.head()
| Booking_ID | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | type_of_meal_plan | required_car_parking_space | room_type_reserved | lead_time | arrival_year | arrival_month | arrival_date | market_segment_type | repeated_guest | no_of_previous_cancellations | no_of_previous_bookings_not_canceled | avg_price_per_room | no_of_special_requests | booking_status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | INN00001 | 2 | 0 | 1 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 224 | 2017 | 10 | 2 | Offline | 0 | 0 | 0 | 65.00 | 0 | Not_Canceled |
| 1 | INN00002 | 2 | 0 | 2 | 3 | Not Selected | 0 | Room_Type 1 | 5 | 2018 | 11 | 6 | Online | 0 | 0 | 0 | 106.68 | 1 | Not_Canceled |
| 2 | INN00003 | 1 | 0 | 2 | 1 | Meal Plan 1 | 0 | Room_Type 1 | 1 | 2018 | 2 | 28 | Online | 0 | 0 | 0 | 60.00 | 0 | Canceled |
| 3 | INN00004 | 2 | 0 | 0 | 2 | Meal Plan 1 | 0 | Room_Type 1 | 211 | 2018 | 5 | 20 | Online | 0 | 0 | 0 | 100.00 | 0 | Canceled |
| 4 | INN00005 | 2 | 0 | 1 | 1 | Not Selected | 0 | Room_Type 1 | 48 | 2018 | 4 | 11 | Online | 0 | 0 | 0 | 94.50 | 0 | Canceled |
Observation: * The DataFrame has 36275 rows and 19 columns as mentioned in the Data Dictionary.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 36275 entries, 0 to 36274 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Booking_ID 36275 non-null object 1 no_of_adults 36275 non-null int64 2 no_of_children 36275 non-null int64 3 no_of_weekend_nights 36275 non-null int64 4 no_of_week_nights 36275 non-null int64 5 type_of_meal_plan 36275 non-null object 6 required_car_parking_space 36275 non-null int64 7 room_type_reserved 36275 non-null object 8 lead_time 36275 non-null int64 9 arrival_year 36275 non-null int64 10 arrival_month 36275 non-null int64 11 arrival_date 36275 non-null int64 12 market_segment_type 36275 non-null object 13 repeated_guest 36275 non-null int64 14 no_of_previous_cancellations 36275 non-null int64 15 no_of_previous_bookings_not_canceled 36275 non-null int64 16 avg_price_per_room 36275 non-null float64 17 no_of_special_requests 36275 non-null int64 18 booking_status 36275 non-null object dtypes: float64(1), int64(13), object(5) memory usage: 5.3+ MB
Observation
There are a total of 36275 non-null observations in each of the columns.
There are 19 columns out of which 5 are object variables others are numeric in that 13 are integers and 1 is float.
A memory usage of 404.9+ KB is used.
df.shape
(36275, 19)
missing_values=pd.isnull(df)
print(missing_values)
Booking_ID no_of_adults no_of_children no_of_weekend_nights \
0 False False False False
1 False False False False
2 False False False False
3 False False False False
4 False False False False
... ... ... ... ...
36270 False False False False
36271 False False False False
36272 False False False False
36273 False False False False
36274 False False False False
no_of_week_nights type_of_meal_plan required_car_parking_space \
0 False False False
1 False False False
2 False False False
3 False False False
4 False False False
... ... ... ...
36270 False False False
36271 False False False
36272 False False False
36273 False False False
36274 False False False
room_type_reserved lead_time arrival_year arrival_month \
0 False False False False
1 False False False False
2 False False False False
3 False False False False
4 False False False False
... ... ... ... ...
36270 False False False False
36271 False False False False
36272 False False False False
36273 False False False False
36274 False False False False
arrival_date market_segment_type repeated_guest \
0 False False False
1 False False False
2 False False False
3 False False False
4 False False False
... ... ... ...
36270 False False False
36271 False False False
36272 False False False
36273 False False False
36274 False False False
no_of_previous_cancellations no_of_previous_bookings_not_canceled \
0 False False
1 False False
2 False False
3 False False
4 False False
... ... ...
36270 False False
36271 False False
36272 False False
36273 False False
36274 False False
avg_price_per_room no_of_special_requests booking_status
0 False False False
1 False False False
2 False False False
3 False False False
4 False False False
... ... ... ...
36270 False False False
36271 False False False
36272 False False False
36273 False False False
36274 False False False
[36275 rows x 19 columns]
df.isnull().sum()
Booking_ID 0 no_of_adults 0 no_of_children 0 no_of_weekend_nights 0 no_of_week_nights 0 type_of_meal_plan 0 required_car_parking_space 0 room_type_reserved 0 lead_time 0 arrival_year 0 arrival_month 0 arrival_date 0 market_segment_type 0 repeated_guest 0 no_of_previous_cancellations 0 no_of_previous_bookings_not_canceled 0 avg_price_per_room 0 no_of_special_requests 0 booking_status 0 dtype: int64
It shows that there are no missing values across each columns.
summary = df.describe()
print(summary)
no_of_adults no_of_children no_of_weekend_nights no_of_week_nights \
count 36275.000000 36275.000000 36275.000000 36275.000000
mean 1.844962 0.105279 0.810724 2.204300
std 0.518715 0.402648 0.870644 1.410905
min 0.000000 0.000000 0.000000 0.000000
25% 2.000000 0.000000 0.000000 1.000000
50% 2.000000 0.000000 1.000000 2.000000
75% 2.000000 0.000000 2.000000 3.000000
max 4.000000 10.000000 7.000000 17.000000
required_car_parking_space lead_time arrival_year arrival_month \
count 36275.000000 36275.000000 36275.000000 36275.000000
mean 0.030986 85.232557 2017.820427 7.423653
std 0.173281 85.930817 0.383836 3.069894
min 0.000000 0.000000 2017.000000 1.000000
25% 0.000000 17.000000 2018.000000 5.000000
50% 0.000000 57.000000 2018.000000 8.000000
75% 0.000000 126.000000 2018.000000 10.000000
max 1.000000 443.000000 2018.000000 12.000000
arrival_date repeated_guest no_of_previous_cancellations \
count 36275.000000 36275.000000 36275.000000
mean 15.596995 0.025637 0.023349
std 8.740447 0.158053 0.368331
min 1.000000 0.000000 0.000000
25% 8.000000 0.000000 0.000000
50% 16.000000 0.000000 0.000000
75% 23.000000 0.000000 0.000000
max 31.000000 1.000000 13.000000
no_of_previous_bookings_not_canceled avg_price_per_room \
count 36275.000000 36275.000000
mean 0.153411 103.423539
std 1.754171 35.089424
min 0.000000 0.000000
25% 0.000000 80.300000
50% 0.000000 99.450000
75% 0.000000 120.000000
max 58.000000 540.000000
no_of_special_requests
count 36275.000000
mean 0.619655
std 0.786236
min 0.000000
25% 0.000000
50% 0.000000
75% 1.000000
max 5.000000
Number of Adults:
The average number of adults per booking is approximately 1.84.
The majority of bookings have 2 adults, as indicated by the median (50th percentile) and the 25th and 75th percentiles all being 2.
The range is between 0 and 4 adults per booking, with a standard deviation of 0.52, indicating relatively little variation.
Number of Children:
The average number of children per booking is 0.11.
The median and the 25th and 75th percentiles all being 0 suggest that most bookings do not include children.
The number of children per booking ranges from 0 to 10, but bookings with children are relatively uncommon, as indicated by the low mean and the small standard deviation of 0.40.
Number of Weekend Nights:
On average, guests book for about 0.81 weekend nights.
The median is 1, with the 25th percentile at 0 and the 75th percentile at 2, indicating a common booking pattern of 1 or 2 weekend nights.
The range is from 0 to 7 weekend nights, with a standard deviation of 0.87, indicating some variability.
Number of Week Nights:
The average number of weeknights booked is 2.20.
The median is 2, with the 25th percentile at 1 and the 75th percentile at 3, suggesting most bookings are for 1 to 3 weeknights.
The range is from 0 to 17 weeknights, with a standard deviation of 1.41, indicating moderate variability.
Required Car Parking Space:
Only about 3% of bookings require a car parking space, as indicated by the mean of 0.03.
The majority of bookings do not require a parking space, as shown by the 0 value for the 25th, 50th, and 75th percentiles.
Lead Time:
The average lead time for bookings is approximately 85 days.
The median lead time is 57 days, with the 25th percentile at 17 days and the 75th percentile at 126 days, indicating a wide range of lead times.
The range is from 0 to 443 days, with a standard deviation of 85.93, showing significant variability.
Arrival Year:
The data spans bookings primarily in the year 2018, as indicated by the mean year being close to 2018 and the minimum and maximum values both being 2018.
Arrival Month:
Bookings are spread throughout the year, with a slight peak around the middle of the year, as the mean month is approximately 7.42.
The median is 8, with the 25th percentile at 5 and the 75th percentile at 10, indicating a fairly even distribution of bookings across different months.
Arrival Date:
The average arrival date is around the 15th of the month.
The range of arrival dates is from 1 to 31, indicating bookings occur throughout the entire month.
Repeated Guest:
Only about 2.56% of guests are repeat visitors, as indicated by the mean. The majority of guests are first-time visitors, as shown by the 0 value for the 25th, 50th, and 75th percentiles.
Number of Previous Cancellations:
The average number of previous cancellations per guest is very low (0.02). The majority of guests have no prior cancellations, as indicated by the 0 value for the 25th, 50th, and 75th percentiles. However, there are some guests with a significant number of previous cancellations, as the maximum value is 13.
Number of Previous Bookings Not Canceled:
The average number of previous bookings not canceled is 0.15. Most guests have no previous bookings that were not canceled, as shown by the 0 value for the 25th, 50th, and 75th percentiles. The maximum value is 58, indicating that some guests have a high number of previous successful bookings.
Average Price Per Room:
The average price per room is approximately $103.42. The median price is $99.45, with the 25th percentile at $80.30 and the 75th percentile at $120.00, indicating a reasonable range for room prices. The price ranges from $0 to $540, with a standard deviation of 35.09, indicating significant variability in room pricing.
Number of Special Requests:
On average, guests make about 0.62 special requests per booking. The median and the 25th percentile values are 0, while the 75th percentile is 1, suggesting that most guests do not make special requests, but a significant portion do make at least one request. The number of special requests ranges from 0 to 5, with a standard deviation of 0.79.
df.describe(include = ['object']).T
| count | unique | top | freq | |
|---|---|---|---|---|
| Booking_ID | 36275 | 36275 | INN00001 | 1 |
| type_of_meal_plan | 36275 | 4 | Meal Plan 1 | 27835 |
| room_type_reserved | 36275 | 7 | Room_Type 1 | 28130 |
| market_segment_type | 36275 | 5 | Online | 23214 |
| booking_status | 36275 | 2 | Not_Canceled | 24390 |
Booking_ID:
There are 36,275 unique booking IDs, indicating the total number of bookings in the dataset.
Type of Meal Plan:
There are 4 different meal plans available to guests. The most common meal plan is "Meal Plan 1," which was selected 27,835 times out of 36,275 bookings. This suggests that "Meal Plan 1" is the most popular choice among guests.
Room Type Reserved:
There are 7 different types of rooms available. The most commonly reserved room type is "Room_Type 1," with 28,130 reservations. This indicates that "Room_Type 1" is the preferred choice for most guests.
Market Segment Type:
There are 5 different market segments from which bookings originate. The "Online" market segment is the largest, accounting for 23,214 out of 36,275 bookings. This suggests that a significant portion of the hotel's bookings come from online sources.
Booking Status:
There are 2 distinct booking statuses: "Not_Canceled" and "Canceled." The majority of bookings, 24,390 out of 36,275, have a status of "Not_Canceled." This indicates that most bookings are successfully completed and not canceled.
Total Bookings:
The dataset contains 36,275 unique bookings, indicating a substantial volume of reservations managed by the hotel. Guest Preferences:
Meal Plans:
"Meal Plan 1" is the most preferred, chosen by 76.8% (27,835 out of 36,275) of the guests. This suggests that this meal plan is likely well-suited to the guests' needs or offers good value.
Room Types: "Room_Type 1" is the most popular room type, reserved 77.5% (28,130 out of 36,275) of the time. This indicates that this room type meets the needs and preferences of a majority of the guests.
Booking Sources:
Market Segments: The majority of bookings (64%) come from the "Online" market segment (23,214 out of 36,275). This suggests that the hotel's online presence and booking system are effective and widely used by guests.
Booking Outcomes:
Booking Status: Most bookings (67%) are "Not_Canceled" (24,390 out of 36,275), indicating a relatively high rate of completed stays. This reflects positively on the hotel's ability to retain bookings and minimize cancellations.
Booking Patterns:
Guests Composition: The average booking consists of approximately 2 adults and rarely includes children, indicating a clientele primarily composed of couples or solo travelers.
Duration of Stay: The typical booking includes around 1 weekend night and 2 weeknights, pointing to a trend of short stays, possibly for short vacations or business trips.
Lead Time: The average lead time of 85 days suggests that many guests plan their stays well in advance, though there is also a significant portion of last-minute bookings.
Special Requests and Parking:
Special Requests: On average, guests make about 0.62 special requests per booking, indicating that while many guests have no special requests, a notable portion do have specific needs. Car Parking: Only 3% of guests require a car parking space, which could imply that most guests rely on public transportation or other means rather than driving their own vehicles.
# Check for duplicates
duplicate_rows = df[df.duplicated()]
# Print the duplicate rows
print("Duplicate Rows:")
print(duplicate_rows)
# Optionally, you can also count the number of duplicates
num_duplicates = df.duplicated().sum()
print("Number of duplicates:", num_duplicates)
Duplicate Rows: Empty DataFrame Columns: [Booking_ID, no_of_adults, no_of_children, no_of_weekend_nights, no_of_week_nights, type_of_meal_plan, required_car_parking_space, room_type_reserved, lead_time, arrival_year, arrival_month, arrival_date, market_segment_type, repeated_guest, no_of_previous_cancellations, no_of_previous_bookings_not_canceled, avg_price_per_room, no_of_special_requests, booking_status] Index: [] Number of duplicates: 0
Observation:
The number of duplicates is 0.
EDA is an important part of any project involving data. It is important to investigate and understand the data better before building a model with it. A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data. A thorough analysis of the data, in addition to the questions mentioned below, should be done.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load the dataset
df = pd.DataFrame(df)
# Display basic statistics for numeric variables
numeric_summary = df.describe()
print(numeric_summary)
# Display basic statistics for categorical variables
categorical_summary = df.describe(include=['object'])
print(categorical_summary)
# Define numeric columns to plot
numeric_cols = ['no_of_adults', 'no_of_children', 'no_of_weekend_nights', 'no_of_week_nights', 'lead_time',
'arrival_year', 'arrival_month', 'arrival_date', 'no_of_previous_cancellations',
'no_of_previous_bookings_not_canceled', 'avg_price_per_room', 'no_of_special_requests']
# Histograms for numeric columns
plt.figure(figsize=(20, 15))
for i, col in enumerate(numeric_cols, 1):
plt.subplot(4, 3, i)
sns.histplot(df[col], kde=True)
plt.title(f'Histogram of {col}')
plt.tight_layout()
plt.show()
# Box plots for numeric columns
plt.figure(figsize=(20, 15))
for i, col in enumerate(numeric_cols, 1):
plt.subplot(4, 3, i)
sns.boxplot(y=df[col])
plt.title(f'Box plot of {col}')
plt.tight_layout()
plt.show()
# Define categorical columns to plot
categorical_cols = ['type_of_meal_plan', 'room_type_reserved', 'market_segment_type', 'booking_status']
# Bar plots for categorical variables
plt.figure(figsize=(15, 10))
for i, col in enumerate(categorical_cols, 1):
plt.subplot(2, 2, i)
df[col].value_counts().plot(kind='bar')
plt.title(f'Bar plot of {col}')
plt.tight_layout()
plt.show()
no_of_adults no_of_children no_of_weekend_nights no_of_week_nights \
count 36275.000000 36275.000000 36275.000000 36275.000000
mean 1.844962 0.105279 0.810724 2.204300
std 0.518715 0.402648 0.870644 1.410905
min 0.000000 0.000000 0.000000 0.000000
25% 2.000000 0.000000 0.000000 1.000000
50% 2.000000 0.000000 1.000000 2.000000
75% 2.000000 0.000000 2.000000 3.000000
max 4.000000 10.000000 7.000000 17.000000
required_car_parking_space lead_time arrival_year arrival_month \
count 36275.000000 36275.000000 36275.000000 36275.000000
mean 0.030986 85.232557 2017.820427 7.423653
std 0.173281 85.930817 0.383836 3.069894
min 0.000000 0.000000 2017.000000 1.000000
25% 0.000000 17.000000 2018.000000 5.000000
50% 0.000000 57.000000 2018.000000 8.000000
75% 0.000000 126.000000 2018.000000 10.000000
max 1.000000 443.000000 2018.000000 12.000000
arrival_date repeated_guest no_of_previous_cancellations \
count 36275.000000 36275.000000 36275.000000
mean 15.596995 0.025637 0.023349
std 8.740447 0.158053 0.368331
min 1.000000 0.000000 0.000000
25% 8.000000 0.000000 0.000000
50% 16.000000 0.000000 0.000000
75% 23.000000 0.000000 0.000000
max 31.000000 1.000000 13.000000
no_of_previous_bookings_not_canceled avg_price_per_room \
count 36275.000000 36275.000000
mean 0.153411 103.423539
std 1.754171 35.089424
min 0.000000 0.000000
25% 0.000000 80.300000
50% 0.000000 99.450000
75% 0.000000 120.000000
max 58.000000 540.000000
no_of_special_requests
count 36275.000000
mean 0.619655
std 0.786236
min 0.000000
25% 0.000000
50% 0.000000
75% 1.000000
max 5.000000
Booking_ID type_of_meal_plan room_type_reserved market_segment_type \
count 36275 36275 36275 36275
unique 36275 4 7 5
top INN00001 Meal Plan 1 Room_Type 1 Online
freq 1 27835 28130 23214
booking_status
count 36275
unique 2
top Not_Canceled
freq 24390
Numeric Variables
Histograms:
Number of Adults (no_of_adults):
Most bookings are for two adults, with a noticeable peak at 2. Few bookings include 3 or 4 adults.
Number of Children (no_of_children):
The majority of bookings have no children. There's a small number of bookings with 1 or 2 children, and very few with more than that.
Number of Weekend Nights (no_of_weekend_nights):
Many bookings include 1 or 2 weekend nights. There's a smaller number of bookings with no weekend nights, indicating stays that span weekdays only.
Number of Week Nights (no_of_week_nights):
The distribution shows that many bookings are for 2 or 3 weeknights. Fewer bookings extend beyond 3 weeknights.
Lead Time (lead_time):
The lead time distribution is right-skewed, indicating that most bookings are made well in advance, with a long tail of last-minute bookings.
Arrival Year (arrival_year):
The data is concentrated around the year 2018, with no significant outliers.
Arrival Month (arrival_month):
Bookings are fairly evenly distributed across the months, with slight peaks possibly around popular travel seasons.
Arrival Date (arrival_date):
The distribution is uniform, reflecting that bookings occur consistently throughout the month.
Number of Previous Cancellations (no_of_previous_cancellations):
Most guests have no previous cancellations, with a few guests having 1 or more. Number of Previous Bookings Not Canceled (no_of_previous_bookings_not_canceled):
Most guests have no previous bookings that were not canceled. There are a few guests with multiple successful bookings.
Average Price Per Room (avg_price_per_room):
The distribution is right-skewed, indicating that while many rooms are priced around the mean, there are some higher-priced rooms.
Number of Special Requests (no_of_special_requests):
Most bookings have no special requests, with a smaller number making 1 or more requests.
Box Plots:
Number of Adults:
The box plot confirms that the median number of adults per booking is 2, with few outliers.
Number of Children:
Most bookings have no children, as indicated by the median and a low number of outliers.
Number of Weekend Nights:
The majority of bookings span 1 or 2 weekend nights, with few outliers.
Number of Week Nights:
The median number of weeknights is around 2, with a reasonable spread up to about 5 weeknights and a few outliers.
Lead Time:
The median lead time is significantly less than the mean, indicating some very high lead times (outliers).
Arrival Year:
No significant outliers; the data is mostly for the year 2018.
Arrival Month:
Even distribution with no significant outliers.
Arrival Date:
Consistent distribution across all dates with no significant outliers.
Number of Previous Cancellations:
Very few previous cancellations per booking, with some outliers indicating guests with multiple cancellations.
Number of Previous Bookings Not Canceled:
Most values are clustered around 0, with some significant outliers indicating high numbers of previous successful bookings.
Average Price Per Room:
A wide range of room prices with some outliers at higher price points. Number of Special Requests:
Most bookings have no special requests, but there are some bookings with multiple requests.
Categorical Variables
Bar Plots:
Type of Meal Plan (type_of_meal_plan):
"Meal Plan 1" is the most popular, followed by "Not Selected". The other meal plans are less frequently chosen.
Room Type Reserved (room_type_reserved):
"Room_Type 1" is the most reserved, followed by other room types in descending order of popularity.
Market Segment Type (market_segment_type):
The "Online" segment is the most significant, indicating that most bookings come from online channels. Other segments like "Offline" and "Corporate" are less common.
Booking Status (booking_status):
The majority of bookings are "Not Canceled," indicating a high completion rate. A smaller proportion of bookings are canceled.
Conclusion
The EDA using histograms, box plots, and bar plots provides a comprehensive view of the hotel bookings dataset:
Booking Demographics: Most bookings are for two adults without children, indicating a preference for couples or solo travelers.
Booking Patterns: Stays generally span 1-2 weekend nights and 2-3 weeknights, with bookings often made well in advance.
Special Requests: Most guests do not have special requests, but there is a significant minority who do.
Pricing: Room prices vary widely, with a significant number of higher-priced bookings.
Market Segments: The majority of bookings come from the online market segment, and "Meal Plan 1" is the most popular choice.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load the dataset
df = pd.DataFrame(df)
# Define numeric and categorical columns
numeric_cols = ['no_of_adults', 'no_of_children', 'no_of_weekend_nights', 'no_of_week_nights', 'lead_time',
'arrival_year', 'arrival_month', 'arrival_date', 'no_of_previous_cancellations',
'no_of_previous_bookings_not_canceled', 'avg_price_per_room', 'no_of_special_requests']
categorical_cols = ['type_of_meal_plan', 'room_type_reserved', 'market_segment_type', 'booking_status']
# Plotting correlation matrix for numeric variables
plt.figure(figsize=(12, 8))
correlation_matrix = df[numeric_cols].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Matrix of Numeric Variables')
plt.show()
# Pairplots for numeric variables
sns.pairplot(df[numeric_cols])
plt.suptitle('Pairplots of Numeric Variables', y=1.02)
plt.show()
# Box plots of numeric variables against categorical variables
for cat_col in categorical_cols:
for num_col in numeric_cols:
plt.figure(figsize=(10, 6))
sns.boxplot(x=df[cat_col], y=df[num_col])
plt.title(f'Box plot of {num_col} by {cat_col}')
plt.xticks(rotation=90)
plt.show()
# Bar plots of categorical variables against each other
for i, cat_col1 in enumerate(categorical_cols):
for j, cat_col2 in enumerate(categorical_cols):
if i < j:
plt.figure(figsize=(10, 6))
sns.countplot(x=df[cat_col1], hue=df[cat_col2])
plt.title(f'Count plot of {cat_col1} by {cat_col2}')
plt.xticks(rotation=90)
plt.show()
Booking Patterns:
Guests tend to book rooms well in advance for longer stays, and these bookings often involve higher room prices. This could suggest a trend where planned vacations or business trips are booked early to secure availability and better prices.
Room Preferences:
The majority of guests prefer certain room types (e.g., Room_Type 1), and these room types often command higher prices. Understanding these preferences can help in room allocation and dynamic pricing strategies. Meal Plans and Pricing:
The popularity of Meal Plan 1 and its potential association with higher room prices suggests that bundling meal plans with room bookings might be a successful strategy for increasing revenue. Market Segments:
The dominance of the "Online" market segment indicates the importance of online marketing and booking platforms. Tailoring marketing efforts to attract more bookings from other segments could help diversify the guest profile. Cancellation Trends:
Bookings with longer lead times and special requests might have different cancellation rates. Addressing issues related to special requests and ensuring guest satisfaction could help reduce cancellations. Special Requests:
Higher-priced rooms tend to have more special requests, indicating that guests paying premium prices might expect more personalized services. Enhancing service quality for these bookings could improve guest satisfaction and loyalty.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load the dataset
df=pd.DataFrame(df)
# Calculate the number of bookings for each month
monthly_bookings = df['arrival_month'].value_counts().sort_index()
# Prepare the data for plotting
monthly_bookings_df = monthly_bookings.reset_index()
monthly_bookings_df.columns = ['Month', 'Number of Bookings']
# Plot the number of bookings for each month
plt.figure(figsize=(10, 6))
sns.barplot(data=monthly_bookings_df, x='Month', y='Number of Bookings', palette='viridis', hue='Month', dodge=False)
plt.title('Number of Bookings per Month')
plt.xlabel('Month')
plt.ylabel('Number of Bookings')
plt.xticks(monthly_bookings_df['Month'] - 1, ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
plt.legend([],[], frameon=False)
plt.show()
# Print the monthly bookings
print(monthly_bookings)
arrival_month 1 1014 2 1704 3 2358 4 2736 5 2598 6 3203 7 2920 8 3813 9 4611 10 5317 11 2980 12 3021 Name: count, dtype: int64
Busiest Month:
October is the busiest month with the highest number of bookings, exceeding 5000. High Booking Months:
Other months with high bookings include September and August, both having more than 4000 bookings. June and July also show a significant number of bookings, indicating a busy summer season. Moderate Booking Months:
April and May have moderate booking numbers, with April showing a slight increase compared to the beginning of the year. November and December have relatively moderate booking levels, likely due to holiday travel. Least Busy Months:
January has the lowest number of bookings, followed by February and March, indicating a quieter start to the year.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load the dataset
df=pd.DataFrame(df)
# Calculate the number of bookings for each market segment
market_segment_bookings = df['market_segment_type'].value_counts()
# Prepare the data for plotting
market_segment_df = market_segment_bookings.reset_index()
market_segment_df.columns = ['Market Segment', 'Number of Bookings']
# Plot the number of bookings for each market segment
plt.figure(figsize=(10, 6))
sns.barplot(data=market_segment_df, x='Market Segment', y='Number of Bookings', palette='viridis', hue='Market Segment', dodge=False)
plt.title('Number of Bookings by Market Segment')
plt.xlabel('Market Segment')
plt.ylabel('Number of Bookings')
plt.xticks(rotation=45)
plt.legend([],[], frameon=False)
plt.show()
# Print the market segment bookings
print(market_segment_bookings)
market_segment_type Online 23214 Offline 10528 Corporate 2017 Complementary 391 Aviation 125 Name: count, dtype: int64
Based on the bar plot showing the number of bookings by market segment, we can make the following observations:
Dominant Market Segment:
The "Online" segment is the largest source of bookings, with over 20,000 bookings. This indicates that a significant majority of guests book their stays through online channels.
Secondary Market Segment:
The "Offline" segment is the second-largest source of bookings, with around 10,000 bookings. This suggests that a substantial number of guests still prefer traditional booking methods like phone calls or walk-ins.
Other Segments:
The "Corporate" segment has a noticeable but much smaller number of bookings compared to "Online" and "Offline". This implies that corporate clients form a smaller part of the hotel's clientele.
The "Complementary" and "Aviation" segments have very few bookings, indicating that these segments contribute minimally to the hotel's overall bookings.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load the dataset
# Load the dataset
df=pd.DataFrame(df)
# Calculate the average room price for each market segment
average_price_per_segment = df.groupby('market_segment_type')['avg_price_per_room'].mean().reset_index()
# Plot the average room price for each market segment
plt.figure(figsize=(10, 6))
sns.barplot(data=average_price_per_segment, x='market_segment_type', y='avg_price_per_room', palette='viridis', hue='market_segment_type', dodge=False)
plt.title('Average Room Price by Market Segment')
plt.xlabel('Market Segment')
plt.ylabel('Average Room Price')
plt.xticks(rotation=45)
plt.legend([],[], frameon=False)
plt.show()
# Print the average room prices
print(average_price_per_segment)
market_segment_type avg_price_per_room 0 Aviation 100.704000 1 Complementary 3.141765 2 Corporate 82.911740 3 Offline 91.632679 4 Online 112.256855
Inference and Differences in Room Prices
Aviation Segment:
The average room price for the "Aviation" segment is the highest, exceeding $100. This suggests that rooms booked through aviation-related channels (e.g., for layover passengers or airline crew) are priced higher, possibly due to the urgent nature of these bookings and the specific requirements of aviation-related stays.
Online Segment:
The "Online" segment also has a high average room price, close to that of the "Aviation" segment. This indicates that online bookings, which are the most common, generally command a higher price. This could be due to dynamic pricing algorithms used by online booking platforms, which adjust prices based on demand.
Corporate Segment:
The average room price for the "Corporate" segment is slightly lower than the "Online" segment but still high, around $90. Corporate clients typically book premium rooms with additional amenities, leading to higher average prices. The hotel might also charge higher rates due to the added value of corporate services and facilities.
Offline Segment:
The "Offline" segment shows a moderate average room price, slightly lower than the "Corporate" segment. This indicates that guests booking through traditional methods (e.g., phone or walk-ins) might be paying less than online or corporate clients. However, these prices are still higher than complementary bookings.
Complementary Segment:
The average room price for the "Complementary" segment is the lowest, close to zero. This is expected as complementary bookings are typically free of charge, offered as part of loyalty programs, promotions, or compensation for service recovery.
import pandas as pd
# Load the dataset
df=pd.DataFrame(df)
# Calculate the total number of bookings
total_bookings = df.shape[0]
# Calculate the number of canceled bookings
canceled_bookings = df[df['booking_status'] == 'Canceled'].shape[0]
# Calculate the percentage of canceled bookings
percentage_canceled = (canceled_bookings / total_bookings) * 100
# Print the result
print(f'Percentage of canceled bookings: {percentage_canceled:.2f}%')
Percentage of canceled bookings: 32.76%
32.76% of bookings have been cancelled.
import pandas as pd
# Load the dataset
df=pd.DataFrame(df)
# Calculate the total number of repeating guests
total_repeating_guests = df[df['repeated_guest'] == 1].shape[0]
# Calculate the number of canceled bookings among repeating guests
canceled_repeating_guests = df[(df['repeated_guest'] == 1) & (df['booking_status'] == 'Canceled')].shape[0]
# Calculate the percentage of canceled bookings among repeating guests
percentage_canceled_repeating_guests = (canceled_repeating_guests / total_repeating_guests) * 100
# Print the result
print(f'Percentage of canceled bookings among repeating guests: {percentage_canceled_repeating_guests:.2f}%')
Percentage of canceled bookings among repeating guests: 1.72%
Observations:
1.72 % of canceled bookings among repeated guests.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import chi2_contingency
# Load the dataset
df=pd.DataFrame(df)
# Calculate the cancellation rate for each number of special requests
special_requests_cancellation = df.groupby('no_of_special_requests')['booking_status'].value_counts(normalize=True).unstack().fillna(0)
special_requests_cancellation['Cancellation_Rate'] = special_requests_cancellation['Canceled'] * 100
# Prepare the data for plotting
special_requests_cancellation_df = special_requests_cancellation.reset_index()
# Plot the cancellation rate by number of special requests
plt.figure(figsize=(10, 6))
sns.barplot(data=special_requests_cancellation_df, x='no_of_special_requests', y='Cancellation_Rate', hue='no_of_special_requests', dodge=False, palette='viridis')
plt.title('Cancellation Rate by Number of Special Requests')
plt.xlabel('Number of Special Requests')
plt.ylabel('Cancellation Rate (%)')
plt.legend([],[], frameon=False)
plt.show()
# Perform a Chi-Square test for independence
contingency_table = pd.crosstab(df['no_of_special_requests'], df['booking_status'])
chi2, p, dof, ex = chi2_contingency(contingency_table)
# Print the result of the Chi-Square test
print(f'Chi-Square Test: chi2={chi2}, p-value={p}')
# Interpretation based on p-value
if p < 0.05:
print('The number of special requests has a significant effect on booking cancellation (p < 0.05).')
else:
print('The number of special requests does not have a significant effect on booking cancellation (p >= 0.05).')
Chi-Square Test: chi2=2421.6187208019905, p-value=0.0 The number of special requests has a significant effect on booking cancellation (p < 0.05).
Observations
Cancellation Rates by Number of Special Requests:
The bar plot shows a clear trend where the cancellation rate decreases as the number of special requests increases.
Guests with zero special requests have the highest cancellation rate, exceeding 40%.
Guests with one special request have a cancellation rate of around 25%.
Guests with two special requests have a cancellation rate of around 15%.
The cancellation rates for guests with three or more special requests are significantly lower and approach zero for guests with four and five special requests.
Statistical Significance:
The Chi-Square test results indicate a chi2 value of 2421.62 and a p-value of 0.0. Since the p-value is less than 0.05, it confirms that the number of special requests has a significant effect on booking cancellations. This means the observed differences in cancellation rates are statistically significant and unlikely to be due to random chance.
Interpretation
Impact of Special Requests on Cancellations:
Guests with no special requests are more likely to cancel their bookings. This could be due to less commitment or fewer specific needs being met by the hotel. Guests with one or more special requests are less likely to cancel, possibly indicating a higher level of engagement and commitment to their bookings due to specific requirements that the hotel needs to fulfill. Service Improvement:
The hotel might consider enhancing its process for managing special requests. Ensuring that guests feel confident their needs will be met can potentially reduce cancellation rates. Clear communication with guests about their special requests and how the hotel plans to accommodate them might further reduce the likelihood of cancellations. Guest Experience:
Improving the overall guest experience, particularly for those with special requests, can lead to higher satisfaction and loyalty. Training staff to handle special requests effectively and ensuring all departments are aware of and prepared to meet these needs can enhance the guest experience.
Targeted Follow-Up:
For guests who do not make any special requests, the hotel could implement follow-up communications to increase engagement and reduce cancellations. This could include reminders about their booking, highlights of hotel amenities, and personalized offers.
Conclusion
The analysis reveals that special requests have a significant impact on booking cancellations. Guests with special requests are less likely to cancel their bookings compared to those without any special requests.
The hotel can leverage this insight to improve its handling of special requests, enhance guest satisfaction, and reduce overall cancellation rates. By focusing on meeting guests' specific needs and ensuring clear communication, the hotel can foster greater commitment and loyalty from its guests.
Data preprocessing is a crucial step in the data analysis and machine learning pipeline. It involves transforming raw data into a clean and usable format to ensure the quality and reliability of the data. This process helps in improving the performance of machine learning models by addressing issues such as missing values, noise, and inconsistencies.
Key Steps in Data Preprocessing
Data Cleaning:
Handling Missing Values: Identifying and dealing with missing data using techniques such as imputation, deletion, or filling with mean/median/mode values.
Removing Duplicates: Identifying and removing duplicate records to ensure data integrity.
Correcting Errors: Identifying and correcting data entry errors, outliers, and inconsistencies.
df.isnull().sum()
Booking_ID 0 no_of_adults 0 no_of_children 0 no_of_weekend_nights 0 no_of_week_nights 0 type_of_meal_plan 0 required_car_parking_space 0 room_type_reserved 0 lead_time 0 arrival_year 0 arrival_month 0 arrival_date 0 market_segment_type 0 repeated_guest 0 no_of_previous_cancellations 0 no_of_previous_bookings_not_canceled 0 avg_price_per_room 0 no_of_special_requests 0 booking_status 0 dtype: int64
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
# Load the dataset
df=pd.DataFrame(df)
# Create a copy of the dataset
df1 = df.copy()
# Step 3: Encoding Categorical Variables
categorical_cols = ['type_of_meal_plan', 'room_type_reserved', 'market_segment_type']
encoder = OneHotEncoder(sparse_output=False, drop='first') # Use sparse_output instead of sparse
encoded_features = encoder.fit_transform(df1[categorical_cols])
encoded_feature_names = encoder.get_feature_names_out(categorical_cols)
# Convert encoded features to a DataFrame
encoded_df = pd.DataFrame(encoded_features, columns=encoded_feature_names)
# Concatenate encoded features with the original dataset
data_encoded = pd.concat([df1.drop(columns=categorical_cols), encoded_df], axis=1)
# Step 4: Creating Interaction Features
data_encoded['total_nights'] = data_encoded['no_of_weekend_nights'] + data_encoded['no_of_week_nights']
# Step 5: Scaling and Normalization
scaler = MinMaxScaler()
scaled_features = scaler.fit_transform(data_encoded)
# Convert scaled features to a DataFrame
data_scaled = pd.DataFrame(scaled_features, columns=data_encoded.columns)
# Display the processed DataFrame
data_scaled.head()
| Booking_ID | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | required_car_parking_space | lead_time | arrival_year | arrival_month | arrival_date | ... | room_type_reserved_1 | room_type_reserved_2 | room_type_reserved_3 | room_type_reserved_4 | room_type_reserved_5 | room_type_reserved_6 | market_segment_type_1 | market_segment_type_2 | market_segment_type_3 | market_segment_type_4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.000000 | 0.50 | 0.0 | 0.142857 | 0.117647 | 0.0 | 0.820513 | 0.0 | 0.818182 | 0.033333 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 1 | 0.000028 | 0.50 | 0.0 | 0.285714 | 0.176471 | 0.0 | 0.018315 | 1.0 | 0.909091 | 0.166667 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 2 | 0.000055 | 0.25 | 0.0 | 0.285714 | 0.058824 | 0.0 | 0.003663 | 1.0 | 0.090909 | 0.900000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 3 | 0.000083 | 0.50 | 0.0 | 0.000000 | 0.117647 | 0.0 | 0.772894 | 1.0 | 0.363636 | 0.633333 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 4 | 0.000110 | 0.50 | 0.0 | 0.142857 | 0.058824 | 0.0 | 0.175824 | 1.0 | 0.272727 | 0.333333 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
5 rows × 30 columns
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
# Load the dataset
df=pd.DataFrame(df)
# Create a copy of the dataset
df1 = df.copy()
# Step 1: Create New Features
# Example: Total nights stayed
df1['total_nights'] = df1['no_of_weekend_nights'] + df1['no_of_week_nights']
# Example: Average price per night
df1['total_nights'] = df1['total_nights'].replace(0, 1) # Avoid division by zero by replacing 0 nights with 1
df1['avg_price_per_night'] = df1['avg_price_per_room'] / df1['total_nights']
# Drop the original columns used to create new features if necessary
# df1.drop(columns=['no_of_weekend_nights', 'no_of_week_nights'], inplace=True)
# Step 2: Encode Categorical Variables
categorical_cols = ['type_of_meal_plan', 'room_type_reserved', 'market_segment_type']
encoder = OneHotEncoder(sparse_output=False, drop='first')
encoded_features = encoder.fit_transform(df1[categorical_cols])
encoded_feature_names = encoder.get_feature_names_out(categorical_cols)
# Convert encoded features to a DataFrame
encoded_df = pd.DataFrame(encoded_features, columns=encoded_feature_names)
# Concatenate encoded features with the original dataset
data_encoded = pd.concat([df1.drop(columns=categorical_cols), encoded_df], axis=1)
# Step 3: Scaling and Normalization
scaler = MinMaxScaler()
scaled_features = scaler.fit_transform(data_encoded)
# Convert scaled features to a DataFrame
data_scaled = pd.DataFrame(scaled_features, columns=data_encoded.columns)
# Display the processed DataFrame
data_scaled.head()
| Booking_ID | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | required_car_parking_space | lead_time | arrival_year | arrival_month | arrival_date | ... | room_type_reserved_1 | room_type_reserved_2 | room_type_reserved_3 | room_type_reserved_4 | room_type_reserved_5 | room_type_reserved_6 | market_segment_type_1 | market_segment_type_2 | market_segment_type_3 | market_segment_type_4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.000000 | 0.50 | 0.0 | 0.142857 | 0.117647 | 0.0 | 0.820513 | 0.0 | 0.818182 | 0.033333 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 1 | 0.000028 | 0.50 | 0.0 | 0.285714 | 0.176471 | 0.0 | 0.018315 | 1.0 | 0.909091 | 0.166667 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 2 | 0.000055 | 0.25 | 0.0 | 0.285714 | 0.058824 | 0.0 | 0.003663 | 1.0 | 0.090909 | 0.900000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 3 | 0.000083 | 0.50 | 0.0 | 0.000000 | 0.117647 | 0.0 | 0.772894 | 1.0 | 0.363636 | 0.633333 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 4 | 0.000110 | 0.50 | 0.0 | 0.142857 | 0.058824 | 0.0 | 0.175824 | 1.0 | 0.272727 | 0.333333 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
5 rows × 31 columns
What is an Outlier?
An outlier is a data point that is significantly different from the rest of the observations in a dataset. Outliers can be unusually high or low values that do not fit the general pattern of the data. These points can arise due to various reasons, such as variability in the data, measurement errors, data entry errors, or genuine anomalies.
Characteristics of Outliers
Extreme Values: Outliers are values that lie far away from the mean or median of the dataset.
Influence on Statistical Measures: Outliers can significantly affect statistical measures like mean, standard deviation, and correlation.
Visual Identification: Outliers can often be visually identified in graphical representations like scatter plots, box plots, and histograms.
Importance of Handling Outliers Impact on Analysis: Outliers can skew the results of statistical analyses and lead to incorrect conclusions.
Model Performance: In machine learning, outliers can negatively impact model performance by distorting parameter estimates and increasing prediction errors.
Data Quality: Handling outliers improves the overall quality and reliability of the data.
Types of Outliers
Univariate Outliers: Outliers that are extreme values in a single feature.
Multivariate Outliers: Outliers that are unusual combinations of multiple features.
Contextual Outliers: Outliers that are considered anomalous in a specific context or condition.
Methods for Detecting Outliers 1.Visual Methods
Box Plot and Scatter plot
2.Statistical Methods:
Z-score and IQR (Interquartile Range)
3.Model-Based Methods
Isolation Forest and Local Outlier Factor (LOF)
Here i have used box plot to detect the outliers
Box Plot: Displays the distribution of data based on five summary statistics (minimum, first quartile, median, third quartile, maximum). Outliers are typically shown as points outside the whiskers.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
# Load the dataset
df=pd.DataFrame(df)
# Create a copy of the dataset
df1 = df.copy()
# Step 1: Outlier Detection using Boxplot
numeric_columns = df1.select_dtypes(include=np.number).columns.tolist()
num_columns = len(numeric_columns)
# Determine the grid size
grid_size = int(np.ceil(np.sqrt(num_columns)))
plt.figure(figsize=(15, 12))
for i, variable in enumerate(numeric_columns):
plt.subplot(grid_size, grid_size, i + 1)
plt.boxplot(df1[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
# Step 2: Treating Outliers
# Define a function to cap outliers
def cap_outliers(df, column, upper_quantile=0.95):
upper_limit = df[column].quantile(upper_quantile)
df[column] = np.where(df[column] > upper_limit, upper_limit, df[column])
return df
# Apply the function to relevant numeric columns
columns_to_cap = ['lead_time', 'no_of_previous_cancellations', 'no_of_previous_bookings_not_canceled', 'avg_price_per_room', 'no_of_special_requests']
for column in columns_to_cap:
df1 = cap_outliers(df1, column)
# Step 3: Create New Features
# Example: Total nights stayed
df1['total_nights'] = df1['no_of_weekend_nights'] + df1['no_of_week_nights']
# Example: Average price per night
df1['total_nights'] = df1['total_nights'].replace(0, 1) # Avoid division by zero by replacing 0 nights with 1
df1['avg_price_per_night'] = df1['avg_price_per_room'] / df1['total_nights']
# Step 4: Encode Categorical Variables
categorical_cols = ['type_of_meal_plan', 'room_type_reserved', 'market_segment_type']
encoder = OneHotEncoder(sparse_output=False, drop='first')
encoded_features = encoder.fit_transform(df1[categorical_cols])
encoded_feature_names = encoder.get_feature_names_out(categorical_cols)
# Convert encoded features to a DataFrame
encoded_df = pd.DataFrame(encoded_features, columns=encoded_feature_names)
# Concatenate encoded features with the original dataset
data_encoded = pd.concat([df1.drop(columns=categorical_cols), encoded_df], axis=1)
# Step 5: Scaling and Normalization
scaler = MinMaxScaler()
scaled_features = scaler.fit_transform(data_encoded)
# Convert scaled features to a DataFrame
data_scaled = pd.DataFrame(scaled_features, columns=data_encoded.columns)
# Display the processed DataFrame
data_scaled.head()
| Booking_ID | no_of_adults | no_of_children | no_of_weekend_nights | no_of_week_nights | required_car_parking_space | lead_time | arrival_year | arrival_month | arrival_date | ... | room_type_reserved_1 | room_type_reserved_2 | room_type_reserved_3 | room_type_reserved_4 | room_type_reserved_5 | room_type_reserved_6 | market_segment_type_1 | market_segment_type_2 | market_segment_type_3 | market_segment_type_4 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.000000 | 0.50 | 0.0 | 0.142857 | 0.117647 | 0.0 | 0.820513 | 0.0 | 0.818182 | 0.033333 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 1 | 0.000028 | 0.50 | 0.0 | 0.285714 | 0.176471 | 0.0 | 0.018315 | 1.0 | 0.909091 | 0.166667 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 2 | 0.000055 | 0.25 | 0.0 | 0.285714 | 0.058824 | 0.0 | 0.003663 | 1.0 | 0.090909 | 0.900000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 3 | 0.000083 | 0.50 | 0.0 | 0.000000 | 0.117647 | 0.0 | 0.772894 | 1.0 | 0.363636 | 0.633333 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 4 | 0.000110 | 0.50 | 0.0 | 0.142857 | 0.058824 | 0.0 | 0.175824 | 1.0 | 0.272727 | 0.333333 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
5 rows × 31 columns
Observations from Outlier Detection and Treatment
Based on the updated boxplot analysis, here are detailed observations:
Number of Adults (no_of_adults):
Outliers: There are few outliers where the number of adults is either 0 or greater than 2. Treatment: These outliers could indicate either errors or special cases (e.g., single parent with children). Number of Children (no_of_children):
Outliers: Significant outliers where the number of children reaches up to 10. Treatment: These outliers might indicate large family bookings. Number of Weekend Nights (no_of_weekend_nights):
Outliers: Few outliers with weekend nights reaching up to 6. Treatment: These values are reasonable and likely represent extended weekend stays. Number of Week Nights (no_of_week_nights):
Outliers: Many outliers where the number of week nights is greater than 5, extending up to 17. Treatment: These values could indicate long-term stays. Required Car Parking Space (required_car_parking_space):
Outliers: Few outliers with requests for car parking space. Treatment: The values are binary (0 or 1) and indicate whether a car parking space is needed. Lead Time (lead_time):
Outliers: Significant outliers with lead times extending up to over 400 days. Treatment: Outliers were capped at the 95th percentile to reduce the impact of extreme values. Arrival Year (arrival_year):
Outliers: A few outliers at the lower end (2017). Treatment: The values represent actual years, so no treatment is necessary. Arrival Month (arrival_month):
Outliers: Minimal outliers. Treatment: The values are within a reasonable range (1 to 12). Arrival Date (arrival_date):
Outliers: Minimal outliers. Treatment: The values are within a reasonable range (1 to 31). Repeated Guest (repeated_guest):
Outliers: Few outliers indicating repeated guests. Treatment: The values are binary (0 or 1) and indicate whether the guest is a repeated guest. Number of Previous Cancellations (no_of_previous_cancellations):
Outliers: Few outliers with guests having up to 12 previous cancellations. Treatment: Outliers were capped at the 95th percentile to reduce the impact of extreme values. Number of Previous Bookings Not Canceled (no_of_previous_bookings_not_canceled):
Outliers: Significant outliers with previous bookings not canceled reaching up to 58. Treatment: Outliers were capped at the 95th percentile to reduce the impact of extreme values. Average Price Per Room (avg_price_per_room):
Outliers: Significant outliers with room prices reaching up to 540 euros. Treatment: Outliers were capped at the 95th percentile to reduce the impact of extreme values. Number of Special Requests (no_of_special_requests):
Outliers: Few outliers with special requests reaching up to 5. Treatment: Outliers were capped at the 95th percentile to reduce the impact of extreme values. Total Nights (total_nights):
Outliers: Many outliers with total nights extending up to 25. Treatment: These values indicate extended stays. Booking Status (booking_status):
Outliers: No outliers observed as the values are binary (0 or 1). Interpretation and Action Handling Outliers:
Capping outliers helps reduce skewness in the data and prevents the model from being overly influenced by extreme values. Further Analysis:
For features like no_of_children and lead_time, further investigation may reveal specific customer segments or booking behaviors that lead to these outliers. Modeling Considerations:
Scaling: Standardizing or normalizing features with significant outliers can improve model performance.
Robust Algorithms: Using algorithms that are less sensitive to outliers or applying preprocessing techniques to mitigate their impact can enhance model robustness.
By addressing outliers, creating new features, and scaling the data, the dataset is now better prepared for analysis and modeling tasks. This process ensures that the model will be less affected by extreme values and will likely perform better on new data.
It is a good idea to explore the data once again after manipulating it.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
# Load the manipulated dataset
df=pd.DataFrame(df)
df1 = df.copy()
# Fill missing values for numerical columns with the mean
numerical_cols = df1.select_dtypes(include=['float64', 'int64']).columns
df1[numerical_cols] = df1[numerical_cols].fillna(df1[numerical_cols].mean())
# Fill missing values for categorical columns with the mode
categorical_cols = df1.select_dtypes(include=['object']).columns
df1[categorical_cols] = df1[categorical_cols].fillna(df1[categorical_cols].mode().iloc[0])
# Create dummy variables for categorical columns
df1 = pd.get_dummies(df1, columns=categorical_cols, drop_first=True)
# Scale the features
scaler = StandardScaler()
df1[numerical_cols] = scaler.fit_transform(df1[numerical_cols])
# Step 1: Summary Statistics
numerical_summary = df1.describe()
print("Numerical Summary:\n", numerical_summary)
# Step 2: Distribution Plots
plt.figure(figsize=(20, 15))
for i, col in enumerate(numerical_cols, 1):
plt.subplot(5, 4, i)
sns.histplot(df1[col], kde=True)
plt.title(f'Distribution of {col}')
plt.tight_layout()
plt.show()
# Step 3: Correlation Matrix
plt.figure(figsize=(15, 10))
correlation_matrix = df1[numerical_cols].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.show()
# Step 4: Box Plots
plt.figure(figsize=(20, 15))
for i, col in enumerate(numerical_cols, 1):
plt.subplot(5, 4, i)
sns.boxplot(x=df1[col])
plt.title(f'Box Plot of {col}')
plt.tight_layout()
plt.show()
# Step 5: Count Plots
categorical_cols = df1.select_dtypes(include=['uint8']).columns
plt.figure(figsize=(20, 15))
for i, col in enumerate(categorical_cols, 1):
plt.subplot(5, 4, i)
sns.countplot(y=df1[col])
plt.title(f'Count Plot of {col}')
plt.tight_layout()
plt.show()
Numerical Summary:
no_of_adults no_of_children no_of_weekend_nights no_of_week_nights \
count 3.627500e+04 3.627500e+04 3.627500e+04 3.627500e+04
mean 4.270112e-17 1.518044e-17 9.950536e-17 -1.165466e-16
std 1.000014e+00 1.000014e+00 1.000014e+00 1.000014e+00
min -3.556844e+00 -2.614704e-01 -9.311902e-01 -1.562353e+00
25% 2.988926e-01 -2.614704e-01 -9.311902e-01 -8.535778e-01
50% 2.988926e-01 -2.614704e-01 2.174012e-01 -1.448030e-01
75% 2.988926e-01 -2.614704e-01 1.365993e+00 5.639718e-01
max 4.154629e+00 2.457446e+01 7.108950e+00 1.048682e+01
required_car_parking_space lead_time arrival_year arrival_month \
count 3.627500e+04 3.627500e+04 3.627500e+04 3.627500e+04
mean 3.917534e-17 6.463931e-17 -2.254506e-13 1.436266e-16
std 1.000014e+00 1.000014e+00 1.000014e+00 1.000014e+00
min -1.788193e-01 -9.918878e-01 -2.137469e+00 -2.092496e+00
25% -1.788193e-01 -7.940515e-01 4.678430e-01 -7.895014e-01
50% -1.788193e-01 -3.285544e-01 4.678430e-01 1.877443e-01
75% -1.788193e-01 4.744282e-01 4.678430e-01 8.392415e-01
max 5.592239e+00 4.163493e+00 4.678430e-01 1.490739e+00
arrival_date repeated_guest no_of_previous_cancellations \
count 3.627500e+04 3.627500e+04 3.627500e+04
mean -4.701041e-17 1.704127e-17 1.008765e-17
std 1.000014e+00 1.000014e+00 1.000014e+00
min -1.670074e+00 -1.622099e-01 -6.339327e-02
25% -8.691889e-01 -1.622099e-01 -6.339327e-02
50% 4.610867e-02 -1.622099e-01 -6.339327e-02
75% 8.469940e-01 -1.622099e-01 -6.339327e-02
max 1.762292e+00 6.164850e+00 3.523139e+01
no_of_previous_bookings_not_canceled avg_price_per_room \
count 3.627500e+04 3.627500e+04
mean -3.094852e-17 -7.051561e-17
std 1.000014e+00 1.000014e+00
min -8.745646e-02 -2.947468e+00
25% -8.745646e-02 -6.589979e-01
50% -8.745646e-02 -1.132419e-01
75% -8.745646e-02 4.724127e-01
max 3.297706e+01 1.244200e+01
no_of_special_requests
count 3.627500e+04
mean 1.664952e-17
std 1.000014e+00
min -7.881400e-01
25% -7.881400e-01
50% -7.881400e-01
75% 4.837605e-01
max 5.571362e+00
<Figure size 2000x1500 with 0 Axes>
Based on the exploratory data analysis (EDA) performed on the hotel booking dataset, here are the general observations:
Most features have a right-skewed distribution, particularly lead_time, avg_price_per_room, and no_of_previous_bookings_not_canceled. Uniform distributions are observed in features like arrival_month and arrival_date, indicating bookings are relatively evenly spread across months and days of the month.
Significant outliers are present in several features, including no_of_children, lead_time, no_of_previous_cancellations, no_of_previous_bookings_not_canceled, and avg_price_per_room.
Outliers indicate some exceptional cases, such as bookings with a very high number of children or extremely long lead times, which may need to be addressed during data preprocessing.
The majority of bookings are for 1 or 2 adults, with most having no children. Most bookings are for 0 to 2 weekend nights and 1 to 3 week nights, indicating typical short stays.
Most bookings have 0 to 1 special requests, suggesting that guests typically do not have many additional requirements.
The majority of guests do not require a car parking space, which may indicate a higher proportion of local or public transport users.
Repeated guests are a small proportion of the overall bookings but tend to have more previous bookings not canceled and fewer cancellations. Guests with higher room prices tend to make more special requests, indicating a correlation between room price and guest expectations.
The lead time for bookings is generally short, with most bookings made within 0 to 100 days before arrival. However, there are significant outliers with lead times extending up to 400 days.
A strong positive correlation is observed between repeated_guest and no_of_previous_bookings_not_canceled, indicating that loyal customers are likely to book more frequently without canceling. Moderate correlations exist between avg_price_per_room and the number of adults and children, suggesting that larger groups tend to book more expensive rooms. Implications for Modeling Handling Outliers: Outliers should be carefully examined and treated if necessary to improve model performance and prevent skewed results. Feature Engineering: Creating additional features such as total nights stayed or interaction terms can help capture more information and improve model predictions. Scaling and Encoding: Proper scaling of numeric features and encoding of categorical features is essential for many machine learning algorithms. Customer Segmentation: Insights from EDA can be used for customer segmentation, allowing for targeted marketing and personalized offers.
To check for multicollinearity in the data, we can use several methods, including:
Correlation Matrix: Examine the correlation coefficients between numeric variables.
Variance Inflation Factor (VIF): Calculate the VIF for each feature to quantify how much the variance is inflated due to multicollinearity.
import pandas as pd
import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor
# Assuming df1 is your preprocessed DataFrame with dummy variables
# Load the manipulated dataset
df=pd.DataFrame(df)
df1 = df.copy()
# Select numeric features for VIF calculation
numeric_cols = df1.select_dtypes(include=['float64', 'int64']).columns
# Calculate VIF for each numeric feature
vif_data = pd.DataFrame()
vif_data['Feature'] = numeric_cols
vif_data['VIF'] = [variance_inflation_factor(df1[numeric_cols].values, i) for i in range(len(numeric_cols))]
print("Variance Inflation Factors (VIF):\n", vif_data)
Variance Inflation Factors (VIF):
Feature VIF
0 no_of_adults 16.448306
1 no_of_children 1.242407
2 no_of_weekend_nights 1.959993
3 no_of_week_nights 3.678980
4 required_car_parking_space 1.062769
5 lead_time 2.174975
6 arrival_year 29.448847
7 arrival_month 7.158118
8 arrival_date 4.204407
9 repeated_guest 1.595823
10 no_of_previous_cancellations 1.337686
11 no_of_previous_bookings_not_canceled 1.603672
12 avg_price_per_room 12.751692
13 no_of_special_requests 1.797588
Observations
There are some columns with very high VIF values, indicating presence of strong multicollinearity
No of Adults (no_of_adults): VIF = 16.45
Arrival Year (arrival_year): VIF = 29.45
Avg Price Per Room (avg_price_per_room): VIF = 12.75
Arrival Month (arrival_month): VIF = 7.16
These features have high VIF values indicating significant multicollinearity.
Arrival Date (arrival_date): VIF = 4.20
No of Week Nights (no_of_week_nights): VIF = 3.68
These features have moderate VIF values.
No of Children (no_of_children): VIF = 1.24
No of Weekend Nights (no_of_weekend_nights): VIF = 1.96
Required Car Parking Space (required_car_parking_space): VIF = 1.06
Lead Time (lead_time): VIF = 2.17
Repeated Guest (repeated_guest): VIF = 1.60
No of Previous Cancellations (no_of_previous_cancellations): VIF = 1.34
No of Previous Bookings Not Canceled (no_of_previous_bookings_not_canceled): VIF = 1.60
No of Special Requests (no_of_special_requests): VIF = 1.80
These features have low VIF values, indicating low multicollinearity.
We will systematically drop numerical columns with VIF > 5
We will ignore the VIF values for dummy variables and the constant (intercept)
To remove multicollinearity
Drop every column one by one that has a VIF score greater than 5. Looking at the adjusted R-squared and RMSE of all these models. Droping the variable that makes the least change in adjusted R-squared.
Checking the VIF scores again and Continuing till we get all VIF scores under 5. Let's define a function that will help us do this.
import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
# Function to calculate VIF
def checking_vif(predictors):
vif = pd.DataFrame()
vif["feature"] = predictors.columns
vif["VIF"] = [round(variance_inflation_factor(predictors.values, i), 2) for i in range(predictors.shape[1])]
return vif
# Function to compute adjusted R-squared
def adj_r2_score(predictors, targets, predictions):
r2 = r2_score(targets, predictions)
n = predictors.shape[0]
k = predictors.shape[1]
return 1 - ((1 - r2) * (n - 1) / (n - k - 1))
# Function to compute MAPE
def mape_score(targets, predictions):
return np.mean(np.abs(targets - predictions) / targets) * 100
# Function to compute different metrics to check performance of a regression model
def model_performance_regression(model, predictors, target):
pred = model.predict(predictors)
r2 = r2_score(target, pred)
adjr2 = adj_r2_score(predictors, target, pred)
rmse = np.sqrt(mean_squared_error(target, pred))
mae = mean_absolute_error(target, pred)
mape = mape_score(target, pred)
df_perf = pd.DataFrame(
{
"RMSE": [rmse],
"MAE": [mae],
"R-squared": [r2],
"Adj. R-squared": [adjr2],
"MAPE": [mape],
}
)
return df_perf
# Function to iteratively remove multicollinearity
def remove_multicollinearity(data, target, features, threshold=5):
X = data[features].fillna(data[features].mean())
X = sm.add_constant(X)
y = data[target]
dropped_features = []
while True:
vif_df = checking_vif(X)
print("VIF values:\n", vif_df)
max_vif = vif_df['VIF'].max()
if max_vif <= threshold:
break
feature_to_drop = vif_df.sort_values('VIF', ascending=False).iloc[0]['feature']
if feature_to_drop == 'const':
print("High VIF due to intercept, stopping removal.")
break
print(f"Dropping feature with highest VIF: {feature_to_drop}")
X = X.drop(columns=[feature_to_drop])
dropped_features.append(feature_to_drop)
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train final OLS model
final_model = sm.OLS(y_train, X_train).fit()
# Evaluate model performance on training set
train_perf = model_performance_regression(final_model, X_train, y_train)
print("Training Performance\n", train_perf)
# Evaluate model performance on test set
test_perf = model_performance_regression(final_model, X_test, y_test)
print("Test Performance\n", test_perf)
print("Final VIF values:\n", checking_vif(X))
return X, dropped_features, final_model
# Load and prepare the dataset
df=pd.DataFrame(df)
df1 = df.copy()
# Define target and features
target = 'avg_price_per_room' # Assuming we want to predict the average price per room
features = ['no_of_adults', 'no_of_children', 'no_of_weekend_nights', 'no_of_week_nights', 'required_car_parking_space',
'lead_time', 'arrival_year', 'arrival_month', 'arrival_date', 'repeated_guest',
'no_of_previous_cancellations', 'no_of_previous_bookings_not_canceled', 'no_of_special_requests']
# Remove multicollinearity
final_X, dropped_features, final_model = remove_multicollinearity(df1, target, features)
print("Dropped features to remove multicollinearity:", dropped_features)
print("Final model summary:\n", final_model.summary())
VIF values:
feature VIF
0 const 33396745.53
1 no_of_adults 1.11
2 no_of_children 1.03
3 no_of_weekend_nights 1.05
4 no_of_week_nights 1.07
5 required_car_parking_space 1.03
6 lead_time 1.14
7 arrival_year 1.21
8 arrival_month 1.22
9 arrival_date 1.00
10 repeated_guest 1.54
11 no_of_previous_cancellations 1.33
12 no_of_previous_bookings_not_canceled 1.59
13 no_of_special_requests 1.11
High VIF due to intercept, stopping removal.
Training Performance
RMSE MAE R-squared Adj. R-squared MAPE
0 29.860042 22.50379 0.277818 0.277469 inf
Test Performance
RMSE MAE R-squared Adj. R-squared MAPE
0 29.965731 22.452745 0.262503 0.261077 inf
Final VIF values:
feature VIF
0 const 33396745.53
1 no_of_adults 1.11
2 no_of_children 1.03
3 no_of_weekend_nights 1.05
4 no_of_week_nights 1.07
5 required_car_parking_space 1.03
6 lead_time 1.14
7 arrival_year 1.21
8 arrival_month 1.22
9 arrival_date 1.00
10 repeated_guest 1.54
11 no_of_previous_cancellations 1.33
12 no_of_previous_bookings_not_canceled 1.59
13 no_of_special_requests 1.11
Dropped features to remove multicollinearity: []
Final model summary:
OLS Regression Results
==============================================================================
Dep. Variable: avg_price_per_room R-squared: 0.278
Model: OLS Adj. R-squared: 0.277
Method: Least Squares F-statistic: 858.3
Date: Thu, 11 Jul 2024 Prob (F-statistic): 0.00
Time: 22:31:18 Log-Likelihood: -1.3974e+05
No. Observations: 29020 AIC: 2.795e+05
Df Residuals: 29006 BIC: 2.796e+05
Df Model: 13
Covariance Type: nonrobust
========================================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------------------
const -3.714e+04 1013.185 -36.660 0.000 -3.91e+04 -3.52e+04
no_of_adults 18.2885 0.356 51.392 0.000 17.591 18.986
no_of_children 27.6942 0.438 63.176 0.000 26.835 28.553
no_of_weekend_nights -2.3475 0.206 -11.372 0.000 -2.752 -1.943
no_of_week_nights -0.2406 0.129 -1.863 0.062 -0.494 0.013
required_car_parking_space 8.8787 1.012 8.773 0.000 6.895 10.862
lead_time -0.0525 0.002 -24.103 0.000 -0.057 -0.048
arrival_year 18.4382 0.502 36.724 0.000 17.454 19.422
arrival_month 1.5344 0.063 24.297 0.000 1.411 1.658
arrival_date 0.0186 0.020 0.923 0.356 -0.021 0.058
repeated_guest -27.3924 1.368 -20.027 0.000 -30.073 -24.712
no_of_previous_cancellations 1.2455 0.541 2.304 0.021 0.186 2.305
no_of_previous_bookings_not_canceled -0.6694 0.125 -5.371 0.000 -0.914 -0.425
no_of_special_requests 2.3632 0.235 10.046 0.000 1.902 2.824
==============================================================================
Omnibus: 2251.483 Durbin-Watson: 1.996
Prob(Omnibus): 0.000 Jarque-Bera (JB): 13283.815
Skew: 0.026 Prob(JB): 0.00
Kurtosis: 6.314 Cond. No. 1.17e+07
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.17e+07. This might indicate that there are
strong multicollinearity or other numerical problems.
Observations from the Final Model
Key Metrics and Coefficients:
Standard Error (std err):
The standard errors of the coefficients are relatively low, indicating precise estimates.
P-value (P>|t|):
Most predictors have p-values less than 0.05, indicating they are statistically significant, except for arrival_date which has a p-value of 0.356.
Confidence Interval:
The confidence intervals for significant predictors do not include zero, which further supports their significance.
Adjusted R-squared:
The adjusted R-squared value is 0.277, suggesting that the model explains about 27.7% of the variance in the target variable (avg_price_per_room).
Multicollinearity:
The VIF values indicate no severe multicollinearity among the predictors, with all VIF values well below the threshold of 5, except for the intercept (const) which is extremely high due to numerical scaling issues.
The condition number (1.17e+07) is high, indicating potential multicollinearity or numerical problems.
Coefficient Interpretation:
Positive Predictors:
no_of_adults: Each additional adult increases the average price per room by approximately 18.29 euros.
no_of_children: Each additional child increases the average price per room by approximately 27.69 euros.
required_car_parking_space: Requiring a car parking space increases the average price per room by approximately 8.88 euros.
arrival_year: Each subsequent year increases the average price per room by approximately 18.44 euros.
arrival_month: Certain months increase the average price per room by approximately 1.53 euros.
no_of_previous_cancellations: Each previous cancellation increases the average price per room by approximately 1.25 euros.
no_of_special_requests: Each special request increases the average price per room by approximately 2.36 euros.
Negative Predictors:
no_of_weekend_nights: Each additional weekend night decreases the average price per room by approximately 2.35 euros.
lead_time: Each additional day of lead time decreases the average price per room by approximately 0.0525 euros.
repeated_guest: Being a repeated guest decreases the average price per room by approximately 27.39 euros.
no_of_previous_bookings_not_canceled: Each additional previous booking not canceled decreases the average price per room by approximately 0.669 euros.
no_of_week_nights: Slightly negative (not significant at the 5% level with p-value = 0.062).
Model Performance:
Training Performance:
RMSE: 29.86 MAE: 22.50 R-squared: 0.277 Adjusted R-squared: 0.277 MAPE: inf (due to division by zero or small target values) Test Performance:
RMSE: 29.97 MAE: 22.45 R-squared: 0.263 Adjusted R-squared: 0.261 MAPE: inf (same reason as above)
To build a Logistic Regression model using this dataset, the following steps are
Preprocess the Data: Handle missing values, encode categorical variables, and scale numerical features.
Split the Data: Divide the data into training and testing sets.
Train the Model: Fit a Logistic Regression model to the training data.
Evaluate the Model: Assess the model's performance using the testing data.
start with the preprocessing steps.
Step 1: Preprocessing the Data We'll handle missing values, encode categorical variables using one-hot encoding, and scale numerical features.
Step 2: Splitting the Data We'll split the data into training and testing sets.
Step 3: Training the Model We'll fit a Logistic Regression model to the training data.
Step 4: Evaluating the Model We'll assess the model's performance using accuracy, precision, recall, and F1 score.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, roc_auc_score, mean_squared_error, mean_absolute_error, r2_score
import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt
# Assuming df is your preprocessed DataFrame
# Ensure target variable is of numeric type
df['booking_status'] = df['booking_status'].astype('category').cat.codes
# Define target and features
target = 'booking_status'
features = ['no_of_adults', 'no_of_children', 'no_of_weekend_nights', 'no_of_week_nights',
'required_car_parking_space', 'lead_time', 'arrival_year', 'arrival_month',
'repeated_guest', 'no_of_previous_cancellations',
'no_of_previous_bookings_not_canceled', 'no_of_special_requests']
# Splitting the data into features and target
X = df[features]
y = df[target]
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
# Add constant to the model (intercept)
X_train_final = sm.add_constant(X_train)
X_test_final = sm.add_constant(X_test)
# Ensure all data is numeric
X_train_final = X_train_final.apply(pd.to_numeric)
X_test_final = X_test_final.apply(pd.to_numeric)
y_train = y_train.apply(pd.to_numeric)
y_test = y_test.apply(pd.to_numeric)
# Training the Logistic Regression model using statsmodels
logit_model_final = sm.Logit(y_train, X_train_final).fit()
print(logit_model_final.summary())
# Function to compute model performance metrics
def model_performance_classification(model, predictors, target):
pred_prob = model.predict(predictors)
pred = (pred_prob > 0.5).astype(int)
accuracy = accuracy_score(target, pred)
roc_auc = roc_auc_score(target, pred_prob)
mse = mean_squared_error(target, pred)
mae = mean_absolute_error(target, pred)
r2 = r2_score(target, pred_prob)
return pd.DataFrame({
"Accuracy": [accuracy],
"ROC-AUC": [roc_auc],
"MSE": [mse],
"MAE": [mae],
"R-squared": [r2]
})
Optimization terminated successfully.
Current function value: 0.478136
Iterations 12
Logit Regression Results
==============================================================================
Dep. Variable: booking_status No. Observations: 29020
Model: Logit Df Residuals: 29007
Method: MLE Df Model: 12
Date: Thu, 11 Jul 2024 Pseudo R-squ.: 0.2440
Time: 22:45:15 Log-Likelihood: -13876.
converged: True LL-Null: -18355.
Covariance Type: nonrobust LLR p-value: 0.000
========================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------------
const 2419.1975 98.980 24.441 0.000 2225.201 2613.194
no_of_adults -0.4394 0.030 -14.623 0.000 -0.498 -0.381
no_of_children -0.5134 0.036 -14.150 0.000 -0.585 -0.442
no_of_weekend_nights -0.1255 0.017 -7.366 0.000 -0.159 -0.092
no_of_week_nights -0.0346 0.011 -3.285 0.001 -0.055 -0.014
required_car_parking_space 1.1290 0.126 8.954 0.000 0.882 1.376
lead_time -0.0110 0.000 -56.330 0.000 -0.011 -0.011
arrival_year -1.1978 0.049 -24.421 0.000 -1.294 -1.102
arrival_month -0.0008 0.005 -0.148 0.882 -0.011 0.010
repeated_guest 2.4000 0.411 5.843 0.000 1.595 3.205
no_of_previous_cancellations -0.2293 0.074 -3.114 0.002 -0.374 -0.085
no_of_previous_bookings_not_canceled 0.1161 0.092 1.255 0.209 -0.065 0.297
no_of_special_requests 1.0426 0.024 42.866 0.000 0.995 1.090
========================================================================================================
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
# Define target and features
target = 'booking_status' # Assuming 'booking_status' is the target variable for classification
features = ['no_of_adults', 'no_of_children', 'no_of_weekend_nights', 'no_of_week_nights',
'required_car_parking_space', 'lead_time', 'arrival_year', 'arrival_month',
'repeated_guest', 'no_of_previous_cancellations',
'no_of_previous_bookings_not_canceled', 'no_of_special_requests',
'avg_price_per_room']
# Split data into features and target
X = df[features]
y = df[target]
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Data preprocessing pipeline
numeric_features = ['no_of_adults', 'no_of_children', 'no_of_weekend_nights', 'no_of_week_nights',
'required_car_parking_space', 'lead_time', 'arrival_year', 'arrival_month',
'repeated_guest', 'no_of_previous_cancellations',
'no_of_previous_bookings_not_canceled', 'no_of_special_requests',
'avg_price_per_room']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler())
])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features)
])
# Prepare the pipeline for logistic regression
from sklearn.linear_model import LogisticRegression
logreg_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', LogisticRegression(random_state=42, max_iter=1000))])
# Fit the model
logreg_pipeline.fit(X_train, y_train)
Pipeline(steps=[('preprocessor',
ColumnTransformer(transformers=[('num',
Pipeline(steps=[('imputer',
SimpleImputer()),
('scaler',
StandardScaler())]),
['no_of_adults',
'no_of_children',
'no_of_weekend_nights',
'no_of_week_nights',
'required_car_parking_space',
'lead_time', 'arrival_year',
'arrival_month',
'repeated_guest',
'no_of_previous_cancellations',
'no_of_previous_bookings_not_canceled',
'no_of_special_requests',
'avg_price_per_room'])])),
('classifier',
LogisticRegression(max_iter=1000, random_state=42))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. Pipeline(steps=[('preprocessor',
ColumnTransformer(transformers=[('num',
Pipeline(steps=[('imputer',
SimpleImputer()),
('scaler',
StandardScaler())]),
['no_of_adults',
'no_of_children',
'no_of_weekend_nights',
'no_of_week_nights',
'required_car_parking_space',
'lead_time', 'arrival_year',
'arrival_month',
'repeated_guest',
'no_of_previous_cancellations',
'no_of_previous_bookings_not_canceled',
'no_of_special_requests',
'avg_price_per_room'])])),
('classifier',
LogisticRegression(max_iter=1000, random_state=42))])ColumnTransformer(transformers=[('num',
Pipeline(steps=[('imputer', SimpleImputer()),
('scaler', StandardScaler())]),
['no_of_adults', 'no_of_children',
'no_of_weekend_nights', 'no_of_week_nights',
'required_car_parking_space', 'lead_time',
'arrival_year', 'arrival_month',
'repeated_guest',
'no_of_previous_cancellations',
'no_of_previous_bookings_not_canceled',
'no_of_special_requests',
'avg_price_per_room'])])['no_of_adults', 'no_of_children', 'no_of_weekend_nights', 'no_of_week_nights', 'required_car_parking_space', 'lead_time', 'arrival_year', 'arrival_month', 'repeated_guest', 'no_of_previous_cancellations', 'no_of_previous_bookings_not_canceled', 'no_of_special_requests', 'avg_price_per_room']
SimpleImputer()
StandardScaler()
LogisticRegression(max_iter=1000, random_state=42)
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve, accuracy_score
# Predict on the test set
y_pred = logreg_pipeline.predict(X_test)
y_pred_prob = logreg_pipeline.predict_proba(X_test)[:, 1]
# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("ROC-AUC Score:", roc_auc_score(y_test, y_pred_prob))
# Plot ROC Curve
import matplotlib.pyplot as plt
fpr, tpr, _ = roc_curve(y_test, y_pred_prob)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label='Logistic Regression (AUC = {:.2f})'.format(roc_auc_score(y_test, y_pred_prob)))
plt.plot([0, 1], [0, 1], linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()
Accuracy: 0.7875947622329428
Classification Report:
precision recall f1-score support
0 0.74 0.56 0.64 2416
1 0.80 0.90 0.85 4839
accuracy 0.79 7255
macro avg 0.77 0.73 0.74 7255
weighted avg 0.78 0.79 0.78 7255
Confusion Matrix:
[[1361 1055]
[ 486 4353]]
ROC-AUC Score: 0.8377523217812228
The logistic regression model demonstrates reasonable performance with an overall accuracy of 78.76% and an AUC of 0.84. The model effectively identifies cancellations, with high recall and precision for the "Canceled" class. Some predictors, such as the number of adults, children, lead time, and special requests, have a significant impact on booking cancellations. However, there is room for improvement, particularly in enhancing the model's ability to correctly identify non-canceled bookings.
print("Columns in X_test:")
print(X_test.columns)
Columns in X_test:
Index(['no_of_adults', 'no_of_children', 'no_of_weekend_nights',
'no_of_week_nights', 'required_car_parking_space', 'lead_time',
'arrival_year', 'arrival_month', 'repeated_guest',
'no_of_previous_cancellations', 'no_of_previous_bookings_not_canceled',
'no_of_special_requests', 'avg_price_per_room'],
dtype='object')
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score, roc_curve, accuracy_score, confusion_matrix
import seaborn as sns
# Print the columns in X_test to identify the correct feature names
print("Columns in X_test:")
print(X_test.columns)
# Assuming you have identified features to drop (replace these with actual feature names if different)
features_to_drop = ['no_of_weekend_nights', 'no_of_week_nights',
'required_car_parking_space', 'lead_time',
'arrival_year', 'arrival_month', 'repeated_guest',
'no_of_previous_cancellations', 'no_of_special_requests']
# Drop the specified features from the test set
X_test1 = X_test.drop(features_to_drop, axis=1)
# Ensure you drop the same features from X_train
X_train1 = X_train.drop(features_to_drop, axis=1)
# Train the Logistic Regression model on the new training set
logit_model_final = sm.Logit(y_train, sm.add_constant(X_train1)).fit()
print(logit_model_final.summary())
# Compute ROC-AUC for the training set
logit_roc_auc_train = roc_auc_score(y_train, logit_model_final.predict(sm.add_constant(X_train1)))
fpr, tpr, thresholds = roc_curve(y_train, logit_model_final.predict(sm.add_constant(X_train1)))
plt.figure(figsize=(7, 5))
plt.plot(fpr, tpr, label="Logistic Regression (area = %0.2f)" % logit_roc_auc_train)
plt.plot([0, 1], [0, 1], "r--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver Operating Characteristic - Training Data")
plt.legend(loc="lower right")
plt.show()
# Predict on the modified test set
pred_test = logit_model_final.predict(sm.add_constant(X_test1)) > 0.5
pred_test = np.round(pred_test)
# Evaluate accuracy
print("Accuracy on training set: ", accuracy_score(y_train, logit_model_final.predict(sm.add_constant(X_train1)) > 0.5))
print("Accuracy on test set: ", accuracy_score(y_test, pred_test))
# Plotting the confusion matrix for the test set
cm_test = confusion_matrix(y_test, pred_test)
plt.figure(figsize=(7, 5))
sns.heatmap(cm_test, annot=True, fmt="g")
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
plt.title("Confusion Matrix - Test Data")
plt.show()
Columns in X_test:
Index(['no_of_adults', 'no_of_children', 'no_of_weekend_nights',
'no_of_week_nights', 'required_car_parking_space', 'lead_time',
'arrival_year', 'arrival_month', 'repeated_guest',
'no_of_previous_cancellations', 'no_of_previous_bookings_not_canceled',
'no_of_special_requests', 'avg_price_per_room'],
dtype='object')
Optimization terminated successfully.
Current function value: 0.615315
Iterations 11
Logit Regression Results
==============================================================================
Dep. Variable: booking_status No. Observations: 29020
Model: Logit Df Residuals: 29015
Method: MLE Df Model: 4
Date: Thu, 11 Jul 2024 Pseudo R-squ.: 0.02566
Time: 23:06:11 Log-Likelihood: -17856.
converged: True LL-Null: -18327.
Covariance Type: nonrobust LLR p-value: 2.598e-202
========================================================================================================
coef std err z P>|z| [0.025 0.975]
--------------------------------------------------------------------------------------------------------
const 1.8116 0.057 32.001 0.000 1.701 1.923
no_of_adults -0.1662 0.026 -6.313 0.000 -0.218 -0.115
no_of_children 0.0877 0.033 2.619 0.009 0.022 0.153
no_of_previous_bookings_not_canceled 1.1931 0.168 7.110 0.000 0.864 1.522
avg_price_per_room -0.0077 0.000 -18.478 0.000 -0.009 -0.007
========================================================================================================
Accuracy on training set: 0.6702274293590628 Accuracy on test set: 0.6638180565127498
Observations
Accuracy on training set: 0.6702274293590628
Accuracy on test set: 0.6638180565127498
Data Preparation: preparing the data and splitting it into training and testing sets.
Training the Decision Tree Model: A Decision Tree model is trained using the training data.
Model Accuracy: The code prints the accuracy of the model on both the training and testing sets.
Confusion Matrix: A confusion matrix is generated and visualized using a heatmap.
Recall Score: The recall score is calculated and printed for both the training and testing sets.
Feature Importance: The feature importance of the trained Decision Tree model is plotted.
Decision Tree Plot: The decision tree is visualized with arrows added to the splits.
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier, plot_tree, export_text
from sklearn.model_selection import train_test_split
from sklearn import metrics
import matplotlib.pyplot as plt
import seaborn as sns
# Assuming df is your preprocessed DataFrame and booking_status is already numeric
# Define target and features
target = 'booking_status'
features = ['no_of_adults', 'no_of_children', 'required_car_parking_space', 'lead_time',
'arrival_month', 'repeated_guest', 'no_of_previous_cancellations',
'no_of_previous_bookings_not_canceled', 'no_of_special_requests']
# Prepare the data
tree_data = df[features + [target]].astype(float)
# Drop irrelevant features
tree_data = tree_data.drop(['arrival_date', 'arrival_year', 'no_of_week_nights', 'no_of_weekend_nights'], axis=1, errors='ignore')
# Split the data into features and target
X = tree_data.drop(target, axis=1)
y = tree_data[target]
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)
# Initialize and fit the Decision Tree model with pre-pruning
dTree = DecisionTreeClassifier(criterion='gini', random_state=1, max_depth=5, min_samples_split=20, min_samples_leaf=5)
dTree.fit(X_train, y_train)
# Print accuracy
print("Accuracy on training set: ", dTree.score(X_train, y_train))
print("Accuracy on test set: ", dTree.score(X_test, y_test))
# Function to make confusion matrix
def make_confusion_matrix(model, y_actual, X_test, labels=[1, 0]):
y_predict = model.predict(X_test)
cm = metrics.confusion_matrix(y_actual, y_predict, labels=labels)
df_cm = pd.DataFrame(cm, index=["Actual - No", "Actual - Yes"], columns=['Predicted - No', 'Predicted - Yes'])
group_counts = ["{0:0.0f}".format(value) for value in cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cm.flatten() / np.sum(cm)]
labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_counts, group_percentages)]
labels = np.asarray(labels).reshape(2, 2)
plt.figure(figsize=(10, 7))
sns.heatmap(df_cm, annot=labels, fmt='', cmap='Blues')
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.title('Confusion Matrix')
plt.show()
# Function to calculate recall score
def get_recall_score(model, X_train, y_train, X_test, y_test):
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)
print("Recall on training set: ", metrics.recall_score(y_train, pred_train))
print("Recall on test set: ", metrics.recall_score(y_test, pred_test))
# Generate confusion matrix for test set
make_confusion_matrix(dTree, y_test, X_test)
# Calculate recall score
get_recall_score(dTree, X_train, y_train, X_test, y_test)
# Plotting Feature Importance
feature_names = list(X_train.columns)
importances = dTree.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(10, 6))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
# Plotting Decision Tree
plt.figure(figsize=(20, 10))
out = plot_tree(
dTree,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=['Not_Canceled', 'Canceled']
)
# Add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree
tree_rules = export_text(dTree, feature_names=feature_names, show_weights=True)
print(tree_rules)
Accuracy on training set: 0.7756379962192816 Accuracy on test set: 0.7782780483322613
Recall on training set: 0.9325855892888602 Recall on test set: 0.9330254041570438
|--- lead_time <= 151.50 | |--- no_of_special_requests <= 0.50 | | |--- lead_time <= 14.50 | | | |--- arrival_month <= 8.50 | | | | |--- arrival_month <= 1.50 | | | | | |--- weights: [0.00, 211.00] class: 1.0 | | | | |--- arrival_month > 1.50 | | | | | |--- weights: [400.00, 1202.00] class: 1.0 | | | |--- arrival_month > 8.50 | | | | |--- arrival_month <= 11.50 | | | | | |--- weights: [87.00, 844.00] class: 1.0 | | | | |--- arrival_month > 11.50 | | | | | |--- weights: [0.00, 302.00] class: 1.0 | | |--- lead_time > 14.50 | | | |--- arrival_month <= 8.50 | | | | |--- arrival_month <= 1.50 | | | | | |--- weights: [16.00, 239.00] class: 1.0 | | | | |--- arrival_month > 1.50 | | | | | |--- weights: [2207.00, 2251.00] class: 1.0 | | | |--- arrival_month > 8.50 | | | | |--- lead_time <= 92.50 | | | | | |--- weights: [505.00, 1786.00] class: 1.0 | | | | |--- lead_time > 92.50 | | | | | |--- weights: [334.00, 283.00] class: 0.0 | |--- no_of_special_requests > 0.50 | | |--- no_of_special_requests <= 1.50 | | | |--- lead_time <= 8.50 | | | | |--- lead_time <= 4.50 | | | | | |--- weights: [29.00, 906.00] class: 1.0 | | | | |--- lead_time > 4.50 | | | | | |--- weights: [44.00, 438.00] class: 1.0 | | | |--- lead_time > 8.50 | | | | |--- required_car_parking_space <= 0.50 | | | | | |--- weights: [981.00, 4084.00] class: 1.0 | | | | |--- required_car_parking_space > 0.50 | | | | | |--- weights: [1.00, 196.00] class: 1.0 | | |--- no_of_special_requests > 1.50 | | | |--- lead_time <= 90.50 | | | | |--- lead_time <= 8.50 | | | | | |--- weights: [2.00, 596.00] class: 1.0 | | | | |--- lead_time > 8.50 | | | | | |--- weights: [36.00, 1842.00] class: 1.0 | | | |--- lead_time > 90.50 | | | | |--- no_of_special_requests <= 2.50 | | | | | |--- weights: [107.00, 391.00] class: 1.0 | | | | |--- no_of_special_requests > 2.50 | | | | | |--- weights: [0.00, 90.00] class: 1.0 |--- lead_time > 151.50 | |--- no_of_adults <= 1.50 | | |--- lead_time <= 189.00 | | | |--- lead_time <= 165.00 | | | | |--- lead_time <= 163.50 | | | | | |--- weights: [61.00, 11.00] class: 0.0 | | | | |--- lead_time > 163.50 | | | | | |--- weights: [9.00, 64.00] class: 1.0 | | | |--- lead_time > 165.00 | | | | |--- no_of_special_requests <= 0.50 | | | | | |--- weights: [198.00, 10.00] class: 0.0 | | | | |--- no_of_special_requests > 0.50 | | | | | |--- weights: [18.00, 13.00] class: 0.0 | | |--- lead_time > 189.00 | | | |--- lead_time <= 255.50 | | | | |--- lead_time <= 192.50 | | | | | |--- weights: [7.00, 61.00] class: 1.0 | | | | |--- lead_time > 192.50 | | | | | |--- weights: [116.00, 77.00] class: 0.0 | | | |--- lead_time > 255.50 | | | | |--- lead_time <= 341.00 | | | | | |--- weights: [15.00, 154.00] class: 1.0 | | | | |--- lead_time > 341.00 | | | | | |--- weights: [30.00, 19.00] class: 0.0 | |--- no_of_adults > 1.50 | | |--- arrival_month <= 11.50 | | | |--- no_of_special_requests <= 0.50 | | | | |--- lead_time <= 271.00 | | | | | |--- weights: [1255.00, 202.00] class: 0.0 | | | | |--- lead_time > 271.00 | | | | | |--- weights: [710.00, 52.00] class: 0.0 | | | |--- no_of_special_requests > 0.50 | | | | |--- no_of_special_requests <= 2.50 | | | | | |--- weights: [1078.00, 480.00] class: 0.0 | | | | |--- no_of_special_requests > 2.50 | | | | | |--- weights: [0.00, 57.00] class: 1.0 | | |--- arrival_month > 11.50 | | | |--- lead_time <= 207.50 | | | | |--- lead_time <= 165.00 | | | | | |--- weights: [2.00, 37.00] class: 1.0 | | | | |--- lead_time > 165.00 | | | | | |--- weights: [16.00, 39.00] class: 1.0 | | | |--- lead_time > 207.50 | | | | |--- lead_time <= 324.00 | | | | | |--- weights: [85.00, 91.00] class: 1.0 | | | | |--- lead_time > 324.00 | | | | | |--- weights: [14.00, 1.00] class: 0.0
Observations
Accuracy on training set: 0.7756379962192816
Accuracy on test set: 0.7782780483322613
Recall on training set: 0.9325855892888602
Recall on test set: 0.9330254041570438
Pruning a decision tree can help to prevent overfitting, which occurs when the model captures noise in the training data rather than the underlying patterns. Pruning reduces the complexity of the tree, which can improve its generalization performance on new, unseen data.
Pre-pruning techniques are applied by setting parameters such as max_depth, min_samples_split, and min_samples_leaf.
These parameters help to limit the growth of the tree and avoid overfitting.
Current model performance:
Training and Testing Accuracy:
Training Accuracy: 77.56%
Testing Accuracy: 77.83%
Recall Scores:
Training Recall: 93.26%
Testing Recall: 93.30%
These results indicate that the model is performing well, with similar performance on both the training and testing sets, suggesting that the current level of pruning is effective.
However, if we want to further explore the effect of different pruning parameters, we can experiment with additional values for max_depth, min_samples_split, and min_samples_leaf.
dTree1 = DecisionTreeClassifier(criterion = 'gini',max_depth=3,random_state=1)
dTree1.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=3, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DecisionTreeClassifier(max_depth=3, random_state=1)
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
from sklearn import metrics
import matplotlib.pyplot as plt
import seaborn as sns
# Assuming df is your preprocessed DataFrame and booking_status is already numeric
# Define target and features
target = 'booking_status'
features = ['no_of_adults', 'no_of_children', 'no_of_weekend_nights', 'no_of_week_nights',
'required_car_parking_space', 'lead_time', 'arrival_year', 'arrival_month',
'repeated_guest', 'no_of_previous_cancellations',
'no_of_previous_bookings_not_canceled', 'no_of_special_requests']
# Prepare the data
tree_data = df[features + [target]].astype(float)
# Drop irrelevant features
tree_data = tree_data.drop(['arrival_date', 'arrival_year', 'no_of_week_nights', 'no_of_weekend_nights'], axis=1, errors='ignore')
# Split the data into features and target
X = tree_data.drop(target, axis=1)
y = tree_data[target]
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)
# Initialize and fit the Decision Tree model with max depth of 3
dTree1 = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=1)
dTree1.fit(X_train, y_train)
# Function to make confusion matrix
def make_confusion_matrix_sklearn(model, X, y):
y_pred = model.predict(X)
cm = metrics.confusion_matrix(y, y_pred)
sns.heatmap(cm, annot=True, fmt='g')
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
plt.title("Confusion Matrix")
plt.show()
# Generate confusion matrix for the test set
make_confusion_matrix_sklearn(dTree1, X_test, y_test)
# Print accuracy
print("Accuracy on training set: ", dTree1.score(X_train, y_train))
print("Accuracy on test set: ", dTree1.score(X_test, y_test))
# Plotting Feature Importances
feature_names = list(X_train.columns)
importances = dTree1.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
# Plotting Decision Tree
plt.figure(figsize=(20, 10))
out = plot_tree(
dTree1,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=['Not_Canceled', 'Canceled']
)
# Add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
Accuracy on training set: 0.7667375551354757 Accuracy on test set: 0.7701920426353027
from sklearn.tree import DecisionTreeClassifier
# Fit the initial decision tree
clf = DecisionTreeClassifier(random_state=1)
clf.fit(X_train, y_train)
# Prune the tree using cost complexity pruning
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
# Plot the total impurity vs effective alpha
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker='o', drawstyle="steps-post")
ax.set_xlabel("Effective Alpha")
ax.set_ylabel("Total Impurity of Leaves")
ax.set_title("Total Impurity vs Effective Alpha for Training Set")
plt.show()
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
# Fit the initial decision tree
clf = DecisionTreeClassifier(random_state=1)
clf.fit(X_train, y_train)
# Prune the tree using cost complexity pruning
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
# Filter out non-valid (negative) ccp_alpha values
valid_ccp_alphas = [alpha for alpha in ccp_alphas if alpha >= 0]
# Select a subset of ccp_alphas to speed up the process
subset_ccp_alphas = valid_ccp_alphas[::10] # Select every 10th value for example
# Decision Tree classifier for every valid alpha
clfs = []
for ccp_alpha in subset_ccp_alphas:
clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
clf.fit(X_train, y_train)
clfs.append(clf)
print("Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, subset_ccp_alphas[-1]))
# Remove the last element which is the trivial tree with one node
clfs = clfs[:-1]
subset_ccp_alphas = subset_ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
# Plotting
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(subset_ccp_alphas, node_counts, marker='o', drawstyle="steps-post")
ax[0].set_xlabel("Alpha")
ax[0].set_ylabel("Number of Nodes")
ax[0].set_title("Number of Nodes vs Alpha")
ax[1].plot(subset_ccp_alphas, depth, marker='o', drawstyle="steps-post")
ax[1].set_xlabel("Alpha")
ax[1].set_ylabel("Depth of Tree")
ax[1].set_title("Depth vs Alpha")
fig.tight_layout()
plt.show()
Number of nodes in the last tree is: 5 with ccp_alpha: 0.010030306239306092
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
# Fit the initial decision tree
clf = DecisionTreeClassifier(random_state=1)
clf.fit(X_train, y_train)
# Prune the tree using cost complexity pruning
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
# Filter out non-valid (negative) ccp_alpha values
valid_ccp_alphas = [alpha for alpha in ccp_alphas if alpha >= 0]
# Select a subset of ccp_alphas to speed up the process
subset_ccp_alphas = valid_ccp_alphas[::10] # Select every 10th value for example
# Decision Tree classifier for every valid alpha
clfs = []
for ccp_alpha in subset_ccp_alphas:
clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
clf.fit(X_train, y_train)
clfs.append(clf)
print("Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, subset_ccp_alphas[-1]))
# Remove the last element which is the trivial tree with one node
clfs = clfs[:-1]
subset_ccp_alphas = subset_ccp_alphas[:-1]
train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]
# Plotting accuracy vs alpha
fig, ax = plt.subplots(figsize=(10, 5))
ax.set_xlabel("Alpha")
ax.set_ylabel("Accuracy")
ax.set_title("Accuracy vs Alpha for Training and Testing Sets")
ax.plot(subset_ccp_alphas, train_scores, marker='o', label="Train", drawstyle="steps-post")
ax.plot(subset_ccp_alphas, test_scores, marker='o', label="Test", drawstyle="steps-post")
ax.legend()
plt.show()
Number of nodes in the last tree is: 5 with ccp_alpha: 0.010030306239306092
# Selecting the best model based on test scores
index_best_model = np.argmax(test_scores)
best_model = clfs[index_best_model]
print("Best Decision Tree Model:\n", best_model)
print('Training accuracy of best model: ', best_model.score(X_train, y_train))
print('Test accuracy of best model: ', best_model.score(X_test, y_test))
# Recall for training set
recall_train = []
for clf in clfs:
pred_train3 = clf.predict(X_train)
values_train = metrics.recall_score(y_train, pred_train3)
recall_train.append(values_train)
# Recall for testing set
recall_test = []
for clf in clfs:
pred_test3 = clf.predict(X_test)
values_test = metrics.recall_score(y_test, pred_test3)
recall_test.append(values_test)
# Plotting recall vs alpha
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(subset_ccp_alphas, recall_train, marker='o', label="train", drawstyle="steps-post")
ax.plot(subset_ccp_alphas, recall_test, marker='o', label="test", drawstyle="steps-post")
ax.legend()
plt.show()
Best Decision Tree Model: DecisionTreeClassifier(ccp_alpha=8.751662815935032e-05, random_state=1) Training accuracy of best model: 0.8482592942659105 Test accuracy of best model: 0.8278967196545071
# Assuming 'recall_test' and 'clfs' have been defined in previous cells
# Creating the model where we get the highest train and test recall
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print("Best Decision Tree Model:\n", best_model)
# Evaluating the best model
print('Training accuracy of best model: ', best_model.score(X_train, y_train))
print('Test accuracy of best model: ', best_model.score(X_test, y_test))
# Evaluating model performance on the training set
train_performance = model_performance_classification_sklearn(best_model, X_train, y_train)
print("Training Performance\n", train_performance)
# Evaluating model performance on the test set
test_performance = model_performance_classification_sklearn(best_model, X_test, y_test)
print("Test Performance\n", test_performance)
# Plotting feature importance for the best model
importances = best_model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Best Decision Tree Model:
DecisionTreeClassifier(ccp_alpha=0.001528941615543386, random_state=1)
Training accuracy of best model: 0.7747321991178324
Test accuracy of best model: 0.7764403197647708
Training Performance
Accuracy Recall ROC-AUC
0 0.774732 0.941335 0.818325
Test Performance
Accuracy Recall ROC-AUC
0 0.77644 0.939954 0.815768
import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics
import seaborn as sns
# Function to make confusion matrix
def make_confusion_matrix_sklearn(model, X, y):
y_pred = model.predict(X)
cm = metrics.confusion_matrix(y, y_pred)
sns.heatmap(cm, annot=True, fmt='g')
plt.xlabel("Predicted Values")
plt.ylabel("Actual Values")
plt.title("Confusion Matrix")
plt.show()
# Generate confusion matrix for the best model on the test set
make_confusion_matrix_sklearn(best_model, X_test, y_test)
the_features = X_train.columns
plt.figure(figsize=(17, 15))
plot_tree(
best_model,
feature_names=the_features,
filled=True,
fontsize=9,
node_ids=True,
class_names=True
)
plt.show()
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
# Function to compute performance metrics for a classification model
def compute_performance_metrics(model, X_train, y_train, X_test, y_test):
metrics_dict = {}
# Training set predictions
y_train_pred = model.predict(X_train)
y_train_prob = model.predict_proba(X_train)[:, 1]
# Test set predictions
y_test_pred = model.predict(X_test)
y_test_prob = model.predict_proba(X_test)[:, 1]
# Training set performance
metrics_dict['Train Accuracy'] = accuracy_score(y_train, y_train_pred)
metrics_dict['Train Precision'] = precision_score(y_train, y_train_pred)
metrics_dict['Train Recall'] = recall_score(y_train, y_train_pred)
metrics_dict['Train F1 Score'] = f1_score(y_train, y_train_pred)
metrics_dict['Train ROC AUC'] = roc_auc_score(y_train, y_train_prob)
# Test set performance
metrics_dict['Test Accuracy'] = accuracy_score(y_test, y_test_pred)
metrics_dict['Test Precision'] = precision_score(y_test, y_test_pred)
metrics_dict['Test Recall'] = recall_score(y_test, y_test_pred)
metrics_dict['Test F1 Score'] = f1_score(y_test, y_test_pred)
metrics_dict['Test ROC AUC'] = roc_auc_score(y_test, y_test_prob)
return metrics_dict
# Compute performance metrics for the best decision tree model
decision_tree_metrics = compute_performance_metrics(best_model, X_train, y_train, X_test, y_test)
# Print performance metrics
decision_tree_metrics_df = pd.DataFrame.from_dict(decision_tree_metrics, orient='index', columns=['Decision Tree'])
print(decision_tree_metrics_df)
# You can add other models' metrics to this DataFrame for comparison
# For example, adding logistic regression model's metrics
# logistic_regression_metrics = compute_performance_metrics(logistic_regression_model, X_train, y_train, X_test, y_test)
# decision_tree_metrics_df['Logistic Regression'] = pd.Series(logistic_regression_metrics)
# Print comparison table
print(decision_tree_metrics_df)
Decision Tree
Train Accuracy 0.774732
Train Precision 0.772493
Train Recall 0.941335
Train F1 Score 0.848597
Train ROC AUC 0.818325
Test Accuracy 0.776440
Test Precision 0.776543
Test Recall 0.939954
Test F1 Score 0.850470
Test ROC AUC 0.815768
Decision Tree
Train Accuracy 0.774732
Train Precision 0.772493
Train Recall 0.941335
Train F1 Score 0.848597
Train ROC AUC 0.818325
Test Accuracy 0.776440
Test Precision 0.776543
Test Recall 0.939954
Test F1 Score 0.850470
Test ROC AUC 0.815768
Observations
Accuracy:
The model shows high accuracy on both the training set (0.775) and the test set (0.776). This indicates that the model effectively predicts the correct classes and generalizes well to unseen data. Precision:
Precision is similarly high for both the training set (0.772) and the test set (0.777). This suggests that the model has a low false positive rate, making few incorrect positive predictions. Recall:
The recall values are exceptionally high for both the training set (0.941) and the test set (0.940). This implies that the model is very effective at identifying true positive cases, making it reliable for scenarios where missing positive cases is critical. F1 Score:
The F1 score, which balances precision and recall, is high for both the training (0.849) and test sets (0.850). This indicates a good balance between precision and recall, making the model robust in terms of both metrics. ROC AUC:
The ROC AUC scores are 0.818 for the training set and 0.816 for the test set. These high values indicate a strong ability of the model to distinguish between positive and negative classes, reflecting good overall performance. Generalization:
The performance metrics for the training and test sets are very close, suggesting that the model generalizes well to unseen data and is not overfitting.
# Summarizing the performance of the best model
best_model_performance = decision_tree_metrics_df['Decision Tree']
print("\nModel Performance Summary:")
print(f"Train Accuracy: {best_model_performance['Train Accuracy']:.2f}")
print(f"Test Accuracy: {best_model_performance['Test Accuracy']:.2f}")
print(f"Train Precision: {best_model_performance['Train Precision']:.2f}")
print(f"Test Precision: {best_model_performance['Test Precision']:.2f}")
print(f"Train Recall: {best_model_performance['Train Recall']:.2f}")
print(f"Test Recall: {best_model_performance['Test Recall']:.2f}")
print(f"Train F1 Score: {best_model_performance['Train F1 Score']:.2f}")
print(f"Test F1 Score: {best_model_performance['Test F1 Score']:.2f}")
print(f"Train ROC AUC: {best_model_performance['Train ROC AUC']:.2f}")
print(f"Test ROC AUC: {best_model_performance['Test ROC AUC']:.2f}")
# Conclusions
print("\nConclusions:")
print("1. The decision tree model shows high accuracy on both the training and test sets, indicating good generalization.")
print("2. The recall scores are particularly high, suggesting that the model is effective in identifying positive cases.")
print("3. The ROC AUC score is also high, indicating a strong ability to distinguish between the classes.")
print("4. Overall, the decision tree model performs well, but it is important to monitor for potential overfitting.")
print("5. Further tuning of hyperparameters and possibly pruning the tree could help in improving the model performance.")
Model Performance Summary: Train Accuracy: 0.77 Test Accuracy: 0.78 Train Precision: 0.77 Test Precision: 0.78 Train Recall: 0.94 Test Recall: 0.94 Train F1 Score: 0.85 Test F1 Score: 0.85 Train ROC AUC: 0.82 Test ROC AUC: 0.82 Conclusions: 1. The decision tree model shows high accuracy on both the training and test sets, indicating good generalization. 2. The recall scores are particularly high, suggesting that the model is effective in identifying positive cases. 3. The ROC AUC score is also high, indicating a strong ability to distinguish between the classes. 4. Overall, the decision tree model performs well, but it is important to monitor for potential overfitting. 5. Further tuning of hyperparameters and possibly pruning the tree could help in improving the model performance.
Based on the model performance and data analysis, some insights and recommendation
Implementing Tiered Refund Policies:
Non-Refundable Rates: Offer a lower rate for bookings that are non-refundable. This will attract price-sensitive customers while ensuring revenue even if cancellations occur.
Partial Refunds: Implement a tiered refund policy where the refund amount decreases as the check-in date approaches.
For example: 90% refund if canceled more than 30 days before check-in.
50% refund if canceled 15-30 days before check-in.
25% refund if canceled 7-14 days before check-in.
No refund if canceled within 7 days of check-in.
Partner with travel insurance companies to offer optional travel insurance to customers. This can cover cancellations due to unforeseen circumstances, reducing the hotel's liability while providing customers with peace of mind.
Introduce flexible booking options with a higher rate, allowing customers to cancel or modify their reservations without penalty up to a certain period before check-in. This can cater to customers seeking flexibility and willing to pay a premium for it.
Offer incentives such as discounts or free amenities for customers who choose to rebook instead of canceling their reservation. This can help retain customers and maintain revenue streams.
Strengthen loyalty programs by offering exclusive benefits such as room upgrades, complimentary services, and special discounts for repeat customers. This can increase customer retention and encourage repeat bookings.
6.Personalize Customer Experience:
Use customer data to personalize their experience. For example, offer tailored packages based on previous stay preferences, send personalized messages or offers for special occasions, and ensure that special requests are noted and fulfilled.
7.Optimize Room Pricing:
Implement dynamic pricing strategies to adjust room rates based on demand, seasonality, and occupancy levels. Utilize data analytics to forecast demand and optimize pricing to maximize revenue.
8.Improve Online Presence and Booking Experience:
Enhance the hotel’s website and mobile app to provide a seamless and user-friendly booking experience. Ensure that the booking process is simple, fast, and secure. Additionally, optimize the hotel’s presence on online travel agencies (OTAs) and maintain high ratings and positive reviews.
Continuously train staff to provide exceptional customer service. Well-trained staff can improve guest satisfaction, handle cancellations and refunds more effectively, and encourage positive reviews and repeat business.
Adopt environmentally friendly practices such as reducing energy consumption, minimizing waste, and using eco-friendly products. Promote these initiatives to attract environmentally conscious travelers.
11.Leverage Technology for Efficiency:
Invest in technology solutions like property management systems (PMS), customer relationship management (CRM) systems, and artificial intelligence (AI) for predictive analytics. These tools can streamline operations, enhance customer engagement, and provide valuable insights for decision-making.
12.Offer Unique Packages and Experiences:
Create unique packages that combine accommodation with local experiences, such as guided tours, culinary classes, or wellness retreats. This can differentiate the hotel from competitors and attract guests looking for more than just a place to stay.
13.Monitor and Respond to Feedback:
Actively monitor online reviews and customer feedback to identify areas for improvement. Respond promptly and professionally to both positive and negative reviews to show that the hotel values its guests opinions.
14.Expand Marketing Efforts:
Utilize targeted marketing campaigns to reach potential customers through various channels, including social media, email newsletters, and partnerships with travel influencers. Highlight unique selling points and promotions to attract a wider audience.
By implementing these profitable policies for cancellations and refunds, along with the additional recommendations, the hotel can enhance its revenue, improve customer satisfaction, and maintain a competitive edge in the market.